some applications of n-dimensional graphics to forensic data mining by monte hancock chief...

24
Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc.

Upload: christian-conley

Post on 14-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

Some Applications of

N-Dimensional Graphics

to Forensic Data Mining

By Monte HancockChief Scientist, Celestech, Inc.

Some Applications of

N-Dimensional Graphics

to Forensic Data Mining

By Monte HancockChief Scientist, Celestech, Inc.

Page 2: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

Before we begin, we answer the “So What?” question:

Does N-dimensional visualization show me anything I can’t see in three dimensions?

Page 3: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com3

Let’s Display a real, 4-feature data set using just 3 of its features

Here I have a 3-D display of a data set consisting of 5,000 points, each having 4 components (dimensions, or “features”). The cluster points are colored according to “class ground truth” contained in the data file.

If I can only display three dimensions

at a time, I must choose three to use... and leave one out.

Page 4: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com4

What if the feature I decide to leave out is one I should have kept?

To make sure that I don’t “lose” something by accidently leaving out a “good” feature, I decide to display the data several times, so that all possible triples of features are used in some plot. There are four such 3-D orthoprojections:

● Plot the data in features 1, 2, and 3● Plot the data in features 1, 2, and 4● Plot the data in features 1, 3, and 4● Plot the data in features 2, 3, and 4

Page 5: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com5

Here are all four of the possible 3-dimensional representations of this 4-dimensional data set. Which one shows how

many natural clusters this data set actually has?Plotted using Plotted using (F1, F2, F3): (F1, F2, F4):

I see four in I see four in this view: red this view: red,green, cyan, green, cyan,and magenta. and magenta.

Plotted using Plotted using (F1, F3, F4): (F2, F3, F4):I see four in I see four in this view: red this view: red,green, cyan, green, cyan,and magenta. and magenta.

(The colors are the ground truth assignments that are present in the data file. They reflect measured truth, and are not determined by the graphics algorithm.)

Page 6: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com6

4-Dimensional visualization shows the data in its natural, 4D space… and a fifth, brown cluster,

having 1,000 data points, that will not be seen as a separate aggregation in any 3D orthoprojection!

No matter which subset of 1, 2, or 3 features you use to plot this raw data, you will NEVER SEE the brown cluster as a separate aggregation. From every 3D perspective, it is in the same place as one of the other three clusters, and is obscured.

In the data’s native space, all FIVE clusters are easily seen.

No special processing, feature selection, or remapping is needed, since Celestech’s N-dimensional visualization technology preserves important aspects of the data’s natural geometry.

This phenomenon illustrated here is not contrived; this kind of occlusion is usually present to some degree in real-world high-dimensional data. This is one of the things that makes pattern processing in these spaces hard.

Page 7: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com7

Here is the original comma-separated variable file. Column A is a record identifier (not data), and column G is a ground truth assignment made by the creator of the data.

Important Note: DATA MUST ALL BE NUMERIC!

Page 8: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com8

The user is asked four questions: 1.) the name of the csv file containing the data

2.) which (if any) of the columns is an identifier (“index column”)

3.) which (if any) of the columns is a ground truth column

4.) how many columns to plot

Page 9: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com9

User Specifications Complete!● All user input has been

entered and echoed at the left

● The application responds by parsing the input file, then running a simple Weighted-Nearest-Neighbor classifier to provide a “quick-analysis” to the user below

● All of this specification information, and the confusion matrix from the classifier, are written to disc in the local directory. The file name is the same as the original file, with “_ds” appended. This is a .txt file

NB: The percentage accuracy here suggests this is highly structured data! We’ll see…. Literally!

Page 10: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com10

● The top figure is the initial display, which shows the 5-dimensional data from the default perspective. Note the clustering!

● Since there are 5,000 rows in the spreadsheet, there are 5,000 points in the display, each colored by the ground truth in its Column G.

● The four-column “control console” is on the left side-bar. Column 1 is a label, column 2 is selected to decrease a value, column 3 is the current value of that control, and column 4 is selected to increase a value. The blank row separates the top section (viewer location) from the bottom section (“speed”, and “”zoom”)

● Top numbers in the top part of the control console refer to each of the five dimensions. According to the numbers, the viewer is now “hovering in 5D space” at location

(3.13, 3.13, 3.13, 3.13, 3.13). Each coordinate of the data is ALWAYS z-scored on input, so the origin is ALWAYS at the center of the display.

● In the bottom figure, the user has clicked the right mouse button, which turns off coloring by ground truth. Another click will restore it.

● If you look closely at the bottom figure, you will be able to see the colored coordinate axes buried inside the data!

Page 11: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com11

● Here the user has increased the “ZOOM” value in the bottom row of the control console. This causes the data to appear to move away. Note that the “+” on the control has turned green; it will stay green to remind the user of the last adjustment made to this control.

● In the bottom figure, the user has clicked the “-” control on the ZOOM row to restore this parameter to its original value, moving the data closer.

● All such parameter changes are executed in real-time on the data display, giving a smooth-motion user experience.

● NB: left-clicking on ANY of the numeric values in column 3 of the control console will restore that value to its initial default. By just holding down the left mouse button and sliding the cursor down column 3, all parameters are reset to their defaults.

Page 12: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com12

In the top figure, the user has been “coptering” around the dataset. Notice that the viewer location has been changed by some “increase” and “decrease” commands, indicated by the read minus and green plus signs.

During this coptering, some of the data have swept across the static display header, erasing part of it. A single click of the middle mouse button refreshes the display (bottom figure).

Page 13: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com13

The numbers in column one of the control console refer to the dimensions being displayed. The user can reorder these by positioning the mouse on one of these numbers; a “left-click” promotes a feature by swapping it with the one above it, while a “right click” demotes a feature by swapping it with the one below it. In this figure, features 4 and 5 have been swapped. The change is reflected in the data display immediately, and the viewer location is reset to the default.

Page 14: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com14

● In the top figure, the viewer has used the mouse to select some clumps of data for subsequent analysis. This is done in the standard way (click and drag). When all “selections” have been made, the user clicks the right mouse button to confirm.

● Each new selection is given a new color, which corresponds to a number that will be appended to the row for each of the selected points.

● If the user wishes, she can now click the right mouse button; this makes unselected data points white. The use should then click the middle button to erase the boxes. The selected clusters will retain their “new” colors during subsequent coptering in this mode, allowing them to be studied in context.

● Clicking the right mouse button again will restore the original colors of all clusters. But the user may return to this “selected clusters only” mode at any time by clicking the right mouse button.

● Additional selections can be made at any time, in any graphical display mode.

● Currently, there is no way to “undo” a selection.

Page 15: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com15

Pressing any keyboard key, or clicking the “red X” will close the application and write out the data files containing file demographics, classification information, and the annotated CSV file showing user selections. This file is a CSV file having the name of the original data file with “_gt” appended.

The format of the output CSV file is the same as the original file, but two columns are appended at the right: the first has the 1-up numbers for the user mouse-selected data, and the second is reserved to hold the assignments of an auto-clustering engine. By default, it contains the number “11”.

Page 16: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

A Short Case Study: Automated Data Characterization for Anomaly

Detection

Page 17: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com17

Anomaly Detection

● There are two fundamental approaches to anomaly detection: Closed Corpus, and Open Corpus

● The “best approach” is application dependent.

● To be general, an application must be either:– easy to repurpose

Or, – have multiple embedded algorithms and an adjudicator that

among them cover a range of problem types

Page 18: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com18

“Closed Corpus”: there is a known, a priori collection of “anomalous” patterns

This approach characterizes anomalous patterns, and creates detectors for similarity to these patterns.

- Strengths:

- Good track record (e.g., virus and spam detection) - Supervised learning can be used, because examples of every target

pattern can be generated

- Weaknesses: ● The corpus must be regularly updated● Patterns not in the corpus will not be detected

Page 19: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com19

“Open Corpus”: there is a known, a priori collection of historical patterns constituting “normalcy”

This approach characterizes normal patterns, and creates detectors for deviation from these patterns.

- Strengths:

- Good track record (e.g., change detection, control systems)- Previously unseen patterns can be evaluated

- Weaknesses: ● More complex and therefore more difficult to build and use● Unsupervised learning must be used, because it is not known a priori

what anomalous patterns must be detected

Page 20: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com20

A Technical Approach● Demanding applications require the strengths of both

open and closed corpus solutions.

● By building a hybrid solution having both open and closed components, all the strengths can be realized in a single application.

● An Anomaly Detection application should be an extensible, multi-component application having both open and closed methods.

● The application should use several anomaly scoring components, each measuring a different aspect of data phenomenology

Page 21: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com21

Unsupervised Clustering Display from a Scoring Component

Page 22: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com22

The Application Produces a Summary Report● When all the Scoring

Components have run, their scores are combined by an adjudicator in the Main Routine. This creates a single anomaly score for each data point.

● The anomaly scores for the data set are statistically normalized (z-scored), then sorted in anomaly score order, and finally plotted for the user as seen here.

● The summary results at the top of the display note that over 2/3 of this data received low or very low scores (as would be expected). Only 0.56% of the data received very high anomaly scores (greater than 3 standard deviations above the mean anomaly score.)

● The horizontal bars are in standard deviations: blue is one sigma above the mean, green is two, and so on.

Page 23: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com23

Interactive N-Dimensional Full-Motion Display of Data colorized by Anomaly Score● The application then

displays the original data colorized by anomaly score.

● This display supports up to 30 dimensions (5 are depicted here)

● The data set slowly revolves as the user watches so that relationships among normal and unusual data can be checked from various perspectives

● Normal and slightly unusual data are in blue and green, respectively. Unusual and “most unusual” data are in yellow and red.

● This data set consists of 5,000 points; 24 are yellow (0.5%), and 4 are red (0.1%). Depending upon the requirements of the domain, of course, thresholds between low and high can be positioned at the discretion of the user.

Page 24: Some Applications of N-Dimensional Graphics to Forensic Data Mining By Monte Hancock Chief Scientist, Celestech, Inc. Some Applications of N-Dimensional

www.celestech.comwww.celestech.com24

The user can suppress “uninteresting” data to view only high anomaly score data● Here the user has “toggled OFF”

the data having low anomaly scores so that the high-score data can be seen clearly.

● Anomalies are “in the eye of the beholder”: what is anomalous in one mission scenario might be of no interest in another. Therefore, a high anomaly score does not prove a datum is anomalous. Rather, in searching for anomalies, this application provides a quick and principled way to sort a problem set in descending order of anomaly score so that those data with having generally unusual characteristics are collected together at the top of the list, and can be checked FIRST.

● Because the processing is completely automated, this sorting paradigm offers a mechanism for forwarding only the most “interesting” data from a disadvantaged collection front-end.

● After checking down to a certain point, the user can be relatively confident that data not reviewed are less likely to be anomalous.