antispam image filtering technologies
DESCRIPTION
Slides from my wildly popular presentation at HP World 2005. Who knew? Grossly over-simplified signal processing methodology and sample photos of models in bikinis was a winning combo, even in San Francisco.TRANSCRIPT
Image-Filtering
Technologies
Michael Lamont
Senior Software
Engineer
Process Software
Overview
• Role of image filtering in anti-spam
filtering
• Two popular image filtering methods:
– Shape recognition
– Skin detection
• Example image filtering
• Image filtering issues
• Tools you can play with on your own
What Isn’t Covered
• Anything requiring advanced math
• Optical character recognition (OCR)
Spam Images
• A picture is worth 1000 words…
• …and it’s a lot harder to filter than
1000 words.
• Especially when spamvertizing
pornography, photos are essential
marketing tools.
Spam Images
• Right now, a spam filter can be very
effective without looking at images.
• This is going to change when the
majority of sites start installing more
accurate filters, and spammers are
forced to adapt.
90-Second Image Review
• To understand how image filtering
technologies work, you need a basic
understanding of how computers
represent images.
• Images are broken into square dots,
which correspond to pixels on a
monitor.
90-Second Image Review
• Example image:
90-Second Image Review
• Each dot’s color is represented by 3
components: red, green, and blue.
• Each of the three color components
has a value of 0 to 255.
• If all three are 0, then the pixel is black.
If all three are 255, then the pixel is
white.
90-Second Image Review
• The higher the number, the more
intense the color component.
• Example: Increasing red value from 0
to 255 while leaving other components
at 0:
Shape Recognition
• Identifies objects in an image using
posterization and edge finding.
• Extracts interesting objects and
searches for similar objects in a
database of “bad” objects.
• For our application, the objects are
human body parts.
Posterization
• Dramatically reduces the number of
colors in an image.
• Has the side effect of lumping most of
an object’s pixels together.
• Called “posterization” because the
same kind of color reduction used to
be done for images printed on posters.
Posterization - Example
Posterization - Example
Posterization - Method
• A number of color bins are created.
• The number of bins is a lot less than
the ~16m colors that are possible.
• Each bin holds several hundred colors
that are closely related.
• Every color in the bin is represented by
the average color.
Posterization - Method
• Example: If a bin contained every
shade of red from light pink to dark
blood, every color in the bin would be
represented by plain old red.
• The posterization process itself
consists of replacing the color of every
pixel in the image with its bin’s
representative color.
Posterization - Example 2
Posterization - Example 2
Posterization - Example 3
Posterization - Example 3
Edge Finding
• After posterizing the image, edge
finding is used to identify individual
objects.
• Edge finding determines the
boundaries between different patches
of color and contrast.
Edge Finding - Example
Edge Finding - Example
Edge Finding - Method
• The edge finding program scans the
image looking for pixels that are very
different from their neighbors.
• When it finds a radically different pixel,
it marks it as part of an edge.
• Good edge finding algorithms look at
lots of neighboring pixels to help
reduce noise.
Edge Finding - Demonstration
Edge Finding - Example 2
Edge Finding - Example 2
Edge Finding - Example 3
Edge Finding - Example 3
Object Extraction
• Once objects have been identified with
posterization and edge finding, they’re
easy to extract.
Object Extraction
• Leg, midriff, and upper torso objects
are being searched in the case of
people wearing swimsuits.
Object Extraction
• A database of known objects is
searched for matches to the extracted
objects.
• Both object shape and color are used
in the search.
• Comparisons are done with a fuzzy
logic algorithm, since it’s unlikely two
objects will be exactly alike.
Skin Detection
• Subset of an image classification
method called color histogram
matching.
• Finds patches of skin tone in an image.
• Calculates the overall percentage of
the image that is skin.
• If more than a specified amount of the
image is skin, it’s filtered.
Skin Tones
• Almost all human skin is the same hue
- saturation differences result in
different skin colors.
• Human skin tones don’t often appear
in other photographed objects, so color
alone can be used to identify skin.
• Skin tones are primarily red, without
any blue and little if any green.
Skin Color Model
• To identify skin tones in an image, a
filter needs to know what colors are
skin tones.
• You could hardcode every skin color,
but there are tens of thousands of
them.
• Much more accurate to identify skin
patches in an image and “train” the
filter.
Skin Color Training
• Works almost like Bayesian filter
training, but with image colors instead
of message tokens.
• Filter maintains one database of skin
colors, and another database of non-
skin colors.
• If a color appears more often in the
skin color database, it’s treated as a
skin color.
Skin Color Training
• This system has the nice side-effect of
dropping out most skin colors that also
appear in non-skin areas of photos.
Training Sample
Skin Identification
• To analyze an image, the filter
examines the color of each pixel.
• If the color is a skin tone, the filter
marks the pixel as skin.
• When every pixel has been examined,
the % of the image that is skin is
calculated.
• If the % is over a specified threshold,
the image is filtered.
Skin Detection Example
Skin Detection Example
Correctly Filtered Images - Shape
Correctly Filtered Images - Shape
Correctly Filtered Images - Skin
Correctly Filtered Images - Skin
Correctly Filtered Images - Shape
Correctly Filtered Images - Shape
Correctly Filtered Images - Shape
Correctly Filtered Images - Shape
Correctly Filtered Images - Skin
Correctly Filtered Images - Skin
Correctly Filtered Images - Shape
Correctly Filtered Images - Shape
Correctly Filtered Images - Shape
Correctly Filtered Images - Skin
Correctly Filtered Images - Skin
Correctly Filtered Images - Shape
Correctly Filtered Images - Shape
Correctly Filtered Images - Shape
Correctly Filtered Images - Skin
Correctly Filtered Images - Skin
Correctly Filtered Images - Shape
Correctly Filtered Images - Shape
Correctly Filtered Images - Shape
Correctly Filtered Images - Skin
Correctly Filtered Images - Skin
Correctly Filtered Images - Shape
Correctly Filtered Images - Shape
Correctly Filtered Images - Shape
Correctly Filtered Images - Skin
Correctly Filtered Images - Skin
Shape Recognition Problems
• Following are examples of images that
shape recognition doesn’t handle
correctly.
• Skin detection handles them correctly,
but only because it’s biased to filter
images with a lot of skin.
Shape Recognition Problems
• Unusual angle obscures shapes
Shape Recognition Problems
• Unusual angle obscures shapes
Shape Recognition Problems
• Unusual angle obscures shapes
Shape Recognition Problems
• Skin detection works
Shape Recognition Problems
• Skin detection works
Shape Recognition Problems
• Shapes are too broken up for the filter
to work
Shape Recognition Problems
• Shapes are too broken up for the filter
to work
Shape Recognition Problems
• Shapes are too broken up for the filter
to work
Shape Recognition Problems
• Skin detection works
Shape Recognition Problems
• Skin detection works
Shape Recognition Problems
• Not enough “swimsuit” objects
Shape Recognition Problems
• Not enough “swimsuit” objects
Shape Recognition Problems
• Not enough “swimsuit” objects
Shape Recognition Problems
• Skin detection works
Shape Recognition Problems
• Skin detection works
Shape Recognition Problems
• Not enough “swimsuit” objects
Shape Recognition Problems
• Not enough “swimsuit” objects
Shape Recognition Problems
• Not enough “swimsuit” objects
Shape Recognition Problems
• Skin detection works
Shape Recognition Problems
• Skin detection works
Shape Recognition Problems
• Image is so noisy that edge detection
goes crazy
Shape Recognition Problems
• Image is so noisy that edge detection
goes crazy
Shape Recognition Problems
• Image is so noisy that edge detection
goes crazy
Shape Recognition Problems
• Amazingly, skin detection still works
Shape Recognition Problems
• Amazingly, skin detection still works
Skin Detection Problems
• Following are examples of images that
skin detection incorrectly filters.
• Shape recognition works for most of
these, mainly because it can’t extract
any useful shapes.
Skin Detection Problems
• Baby photos tend to show lots of skin
Skin Detection Problems
• Baby photos tend to show lots of skin
Skin Detection Problems
• Shape recognition doesn’t filter the
image
Skin Detection Problems
• Shape recognition doesn’t filter the
image
Skin Detection Problems
• Shape recognition doesn’t filter the
image
Skin Detection Problems
• Portraits have the same problem as
babies.
Skin Detection Problems
• Portraits have the same problem as
babies.
Skin Detection Problems
• Shape recognition ignores the image.
Skin Detection Problems
• Shape recognition ignores the image.
Skin Detection Problems
• Shape recognition ignores the image.
Skin Detection Problems
• In the right light, sand can be the same
color as skin.
Skin Detection Problems
• In the right light, sand can be the same
color as skin.
Skin Detection Problems
• That’s fairly rare - usually skin color
models exclude sand colors.
Skin Detection Problems
• That’s fairly rare - usually skin color
models exclude sand colors.
Skin Detection Problems
• Black & white images can’t be filtered
Skin Detection Problems
• It also makes life rough on shape
recognition filters.
Skin Detection Problems
• It also makes life rough on shape
recognition filters.
Wedding Photos
• Wedding photos are guaranteed to
make a mess of image filters.
• Skin fades into the background
because of soft lighting, soft filters, and
retouching.
• Turns out that brides get upset if the
image is crystal clear with good
contrast - it shows off skin flaws.
Wedding Photos
• Skin detection filters start identifying
everything as skin (false positive).
• Shape recognition filters give up and
don’t filter the message (accurate, but
not for the right reasons).
• Porn tends not to be shot with soft
lighting - good contrast makes skin
“pop” in photos.
Example Wedding Photo - Shape
Example Wedding Photo - Shape
Example Wedding Photo - Shape
Example Wedding Photo - Skin
Example Wedding Photo - Skin
Example Wedding Photo - Shape
Example Wedding Photo - Shape
Example Wedding Photo - Shape
Example Wedding Photo - Skin
Example Wedding Photo - Skin
“Art Porn”
• Usually shot with the same lighting
effects as wedding photos.
• Rarely seen in email.
• In this case, skin detection is accurate
for the wrong reasons while shape
recognition lets the image pass.
“Art Porn” Example - Shape
“Artistic” Example - Shape
“Artistic” Example - Shape
“Artistic” Example - Skin
“Artistic” Example - Skin
Things I Can’t Show You
• S & M
– Skin tends to be covered with “clothing”
– Shapes are broken up by all of the
paraphernalia
• Simpson’s shocker
• Still images from “interesting” videos
– Images are badly pixelated
– Colors are muddy and smudged
Image Filtering Issues
• Accuracy:
– Shape recognition misses lots of images it
shouldn’t (false negatives)
– Skin detection filters lots of images it
shouldn’t (false positives)
– Best skin detection systems are about
80% accurate
– Best shape recognition systems are about
40% accurate
Image Filtering Issues
• Performance:
– Image filtering requires huge amounts of
memory, CPU time, and disk bandwidth.
– Unacceptably slows down most site’s
email servers/filtering systems.
– DL380 benchmark:
• ~1.2 million messages/hour with no filtering
• ~195,000 messages/hour with skin detection
• ~69,000 messages/hour with shape recognition
Image Filtering Issues
• Diminishing returns on accuracy - most
spam filters won’t see a noticeable
increase in accuracy with the addition
of image filtering.
• That’s likely to change in the future as
spammers discover it’s one of the
better options for circumventing current
solutions.
I Wanna Play!
• Shape recognition:
– UC Berkeley’s blobworld
• Open source
• http://elib.cs.berkeley.edu/
– Skin detection
• No good open-source examples
• Trivial to write your own using ImageMagick
• http://www.imagemagick.org/
Quick Review
• We covered:
– How and why images appear in spam
– Why the use of images in spam is likely to
increase
– Two methods for filtering images
– Examples of how the two methods work
and don’t work
– Why image filtering isn’t widely used at
this point.