these are general notional tutorial slides on data mining theory and practice from which content may...
TRANSCRIPT
![Page 1: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/1.jpg)
These are general notional tutorial slides on data mining theory and practice from which content may be
freely drawn.
Monte F. Hancock, Jr.
Chief Scientist
Celestech, Inc.
![Page 2: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/2.jpg)
Data Mining is the detection, characterization, and exploitation of
actionable patterns in data.
![Page 3: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/3.jpg)
Data Mining (DM)• Data Mining (DM) is the principled detection, characterization, and
exploitation of actionable patterns in data.• It is performed by applying modern mathematical techniques to collected
data in accordance with the scientific method. • DM uses a combination of empirical and theoretical principles to Connect
Structure to Meaning by:– Selecting and conditioning relevant data– Identifying, characterizing, and classifying latent patterns– Presenting useful representations and interpretations to users
• DM attempts to answer these questions:– What patterns are in the information? – What are the characteristics of these patterns?– Can “meaning” be ascribed to these patterns and/or their changes?– Can these patterns be presented to users in a way that will facilitate their assessment,
understanding, and exploitation?– Can a machine learn these patterns and their relevant interpretations?
![Page 4: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/4.jpg)
DM for Decision Support
● “Decision Support” is all about…– enabling users to group information in familiar ways – controlling complexity by layering results (e.g., drill-down)– supporting user’s changing priorities – allowing intuition to be triggered (“I’ve seen this before!”)– preserving and automating perishable institutional knowledge– providing objective, repeatable metrics (e.g., confidence factors)– fusing & simplifying results – automating alerts on important results (“It’s happening again!”)– detecting emerging behaviors before they consummate (“Look!”) – delivering value (timely-relevant-accurate results)
● …helping users make the best choices.
![Page 5: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/5.jpg)
DM Provides “Intelligent” Analytic Functions
● Automating pattern detection – to characterize complex, distributed signatures that are worth human attention… and recognize those that are not.
● Associating events – that “go together” but are difficult for humans to correlate.
● Characterizing interesting processes – not just facts or simple events
● Detecting actionable anomalies – and explaining what makes them “different AND interesting”.
● Describing contexts – from multiple perspectives –with numbers, text and graphics
![Page 6: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/6.jpg)
DM Answers Questions Users are Asking
● Fusion Level 1: Who/What is Where/When in my space?
– Organize and present facts in domain context
● Fusion Level 2: What does it mean? – Has this been seen before? What will happen next?
● Fusion Level 3: Do I care?– Enterprise relevance? What action should be taken?
● Fusion Level 4: What can I do better next time?– Adaptation by pattern updates and retraining
● How certain am I?– Quantitative assessment of evidentiary pedigree
![Page 7: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/7.jpg)
Useful Data Applications
● Accurate identification and classification– add value to raw data by tagging and annotation (e.g., fraud detection)
● Anomaly / normalcy and fusion – characterize, quantify, and assess “normalcy” of patterns and trends (e.g., network intrusion detection)
● Emerging patterns and evidence evaluation - capturing institutional knowledge of how “events” arise and alerting when they emerge
● Behavior association - detection of actions that are distributed in time & space but “synchronized” by a common objective: “connecting the dots”
● Signature detection and association – detection & characterization of multivariate signals, symbols, and emissions (e.g., voice recognition)
● Concept tagging - reasoning about abstract relationships to tag and annotate media of all types (e.g., automated web bots)
● Software agents assisting analysts – small-footprint “fire-and-forget” apps that facilitate search, collaboration, etc.
![Page 8: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/8.jpg)
Some “Good” Data Mining Analytic Applications
• Help the user focus via unobtrusive automation– Off-load burdensome labor (perform intelligent searches, smart winnowing)– Post “smart” triggers/tripwires to data stream (e.g., anomaly detection)– Help with mission triage (“Sort my in-basket!”)
• Automate aspects of classification and detection– Determine which sets of data hold the most information for a task– Support construction of ad hoc “on-the-fly” classifiers– Provide automated constructs for merging decision engines (multi-level fusion)– Detect and characterize “domain drift” (the “rules of the game” are changing)– Provide functionality to make best estimate of “missing data”
• Extract/characterize/employ knowledge– Rule induction from data, develop “signatures” from data– Implement reasoning for decision support– High-dimensional visualization– Embed “decision explanation” capability into analytic applications
• Capture/automate/institutionalize best practice– Make proven analytic processes available to all– Capture rare, perishable human knowledge… and put it everywhere– Generate “signature-ready” prose reports– Capture and characterize the analytic process to anticipate user needs
![Page 9: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/9.jpg)
Things that make “hard” problems VERY hard
– Events of interest occur relatively infrequently in very large datasets (“population imbalance”)
– Information is distributed in a complex way across many features (the “feature selection problem”)
– Collection is hard to task, data are difficult to prepare for analysis, and are never “perfect” (“noise” in the data, data gaps, coverage gaps)
– Target patterns are ambiguous/unknown; “squelch” settings are brittle (e.g., hard to balance detection vs. “false-alarm” rates)
– Target patterns change/morph over time and across operational modes (“domain drift”, processing methods becomes “stale”)
![Page 10: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/10.jpg)
Some Key Principles of “Information Driven” Data Mining
1. Right People, Methods, Tools (in that order)2. Make no prior assumptions about the problem (“agnostic”)3. Begin with general techniques that let the data determine the direction of the
analysis (“Funnel Method”)4. Don’t jump to conclusions; perform process audits as needed5. Don’t be a “one widget wonder”; integrate multiple paradigms so the strengths
of one compensate for the weaknesses of another6. Break the problem into the right pieces (“Divide and Conquer”)7. Work the data, not the tools, but automate when possible8. Be systematic, consistent, thorough; don’t lose the forest for the trees.9. Document the work so that it is reproducible10. Collaborate to avoid surprises: team members, experts, customer11. Focus on the Goal: maximum value to the user within cost and schedule
![Page 11: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/11.jpg)
Select Appropriate Machine Reasoners
1.) Classifiers Classifiers ingest a list of attributes, and determine into which of finitely many categories the entity exhibiting these
attributes falls. Automatic object recognition and next-event prediction are examples of this type of reasoning. 2.) Estimators Estimators ingest a list of attributes, and assign some numeric value to the entity exhibiting these attributes. The
estimation of a probability or a "risk score" are examples of this type of reasoning.
3.) Semantic Mappers Semantic mappers ingest text (structured, unstructured, or both), and generate a data structure that gives the "meaning"
of the text. Automatic gisting of documents is an example of this type of reasoning Semantic mapping generally requires some kind of domain model.
4.) Planners Planners ingest a scenario description, and formulate an efficient sequence of feasible actions that will move the domain
to the specified goal state. 5.) Associators Associators sample the entire corpus of domain data, and identify relationships among entities. Automatic clustering of
data to identify coherent subpopulations is a simple example. A more sophisticated example is the forensic analysis of phone, flight, and financial records to infer the structure of terrorist networks.
![Page 12: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/12.jpg)
Embedded Knowledge…
• Principled, domain-savvy synthesis of “circumstantial” evidence
• Copes well with ambiguous, incomplete, or incorrect input
• Enables justification of results in terms domain experts use
• Facilitates good pedagogical helps
• “Solves the problem like the man does”, and so is comprehensible to most domain experts.
• Degrades linearly in combinatorial domains
• Can grow in power with “experience”
• Preserves perishable expertise
• Allows efficient incremental upgrade/adjustment/repurposing
![Page 13: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/13.jpg)
Features
• A feature is the value assumed by some attribute of an entity in the domain
(e.g., size, quality, age, color, etc.)
• Features can be numbers, symbols, or complex data objects
• Features are usually reduced to some simple form before modeling is performed.
>>>features are usually single numeric values or contiguous strings.<<<
![Page 14: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/14.jpg)
Feature Space• Once the features have been designated, a feature space can be defined
for a domain by placing the features into an ordered array in a systematic way.
• Each instance of an entity having the given features is then represented by a single point in n-dimensional Euclidean space: its feature vector.
• This Euclidean space, or feature space for the domain, has dimension equal to the number of features.
• Feature spaces can be one-dimensional, infinite-dimensional, or anywhere in between.
![Page 15: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/15.jpg)
How do classifiers work?
![Page 16: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/16.jpg)
![Page 17: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/17.jpg)
Machines
• Data mining paradigms are characterized by– A “concept of operation (CONOP: component structure, I/O, training
alg., operation) – An architecture (component type, #, arrangement, semantics) – A set of parameters (weights/coefficients/vigilance parameters)
>>>it is assumed here that parameters are real numbers.<<<
A machine is an instantiation of a data mining paradigm.
• Examples of parameter sets for various paradigms– Neural Networks: interconnect weights– Belief Networks: conditional probability tables– Kernel-Based-classifiers (SVM, RBF): regression coefficients– Metric classifiers (K-means): cluster centroids
![Page 18: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/18.jpg)
A Spiral Methodology for theData Mining Process
![Page 19: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/19.jpg)
The DM Discovery Phase: Descriptive Modeling
• OLAP• Visualization • Unsupervised learning• Link Analysis/Collaborative Filtering• Rule Induction
![Page 20: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/20.jpg)
The DM Exploitation Phase: Predictive Modeling
• Paradigm selection• Test design• Formulation of meta-schemes• Model construction• Model evaluation• Model deployment• Model maintenance
![Page 21: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/21.jpg)
A “de facto” standard DM Methodology
CRISP-DM (“cross-industry standard process for data mining”)
– 1.) Business Understanding
– 2.) Data Understanding
– 3.) Data Preparation
– 4.) Modeling
– 5.) Evaluation
– 6.) Deployment
![Page 22: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/22.jpg)
Data Mining Paradigms: What does your solution look like?
• Conventional Decision Models -statistical inference, logistic regression, score cards
• Heuristic Models -human expert, knowledge-based expert systems,
fuzzy logic, decision trees, belief nets• Regression Models
-neural networks (all sorts), radial basis functions, adaptive logic networks, decision trees, SVM
![Page 23: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/23.jpg)
Real-World DM Business Challenges
• Complex and conflicting goals– Defining “success”– Getting “buy in”
• Enterprise data is distributed
• Limited automation
• Unrealistic expectations
![Page 24: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/24.jpg)
Real-World DM Technical Challenges
• big data consume space and time
• efficiency vs. comprehensibility
• combinatorial explosion
• diluted information
• difficult to develop “intuition”
• algorithm roulette
![Page 25: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/25.jpg)
Data Mining Problems: What does your domain look like?
• How well is the problem understood?
• How "big" is the problem?
• What kind of data do we have?
• What question are we answering?
• How deeply buried in the data is the answer?
• How must the answer be presented to the user?
![Page 26: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/26.jpg)
1. Business Understanding
How well is the problem understood?
![Page 27: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/27.jpg)
How well is the problem understood?
•Domain intuition: low/medium/high–Experts available?–Good documentation?–DM team’s prior experience?–Prior art?
•What is the enterprise definition of “success”?•What is the target environment?•How skillful are the users?•Where are the pitchforks?
![Page 28: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/28.jpg)
2. Data Understanding3. Preparing the Data
How "big" is the problem?
What kind of data do we have?
![Page 29: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/29.jpg)
DM Aspects of Data Preparation
• Data Selection
• Data Cleansing
• Data Representation
• Feature Extraction and Transformation
• Feature Enhancement
• Data Division
• Configuration Management
![Page 30: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/30.jpg)
How "big" is the problem?
•Number of exemplars (“rows”)
•Number of features (“columns”)
•Number of classes (“ground truth”)
•Cost/schedule/talent (dollars, days, dudes)
•Tools (own/make/buy, familiarity, scope)
![Page 31: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/31.jpg)
What kind of data do we have?•Feature type: nominal/numeric/complex•Feature mix: homo/heterogeneous by type•Feature tempo:
–Fresh/stale–Periodic/sporadic–Synchronous/asynchronous
•Feature data quality:–Low/high SNR –Few/many gaps–Easy/hard to access–Objective/subjective
•Feature information quality–Salience, correlation, localization, conditioning–Comprehensive? Representative?
![Page 32: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/32.jpg)
How much data do I need?
• Many heuristics– Monte’s 6MN rule, other similar– Support vectors
• Segmentation requirements
• Comprehensive
• Representative– Consider population imbalance
![Page 33: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/33.jpg)
Feature Saliency Tests
• Correlation/Independence
• Visualization to determine saliency
• Autoclustering to test for homogeneity
• KL-Principal Component Analysis
• Statistical Normalization (e.g., ZSCORE)
• Outliers, Gaps
![Page 34: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/34.jpg)
Making Feature Sets for Data Mining
• Converting Nominal Data to Numeric: Numeric Coding
• Converting Numeric data to Nominal: Symbolic Coding
• Creating Ground-Truth
![Page 35: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/35.jpg)
Information can be Irretrievably Distributed (e.g., the parity-N problem)
0010100110… 1
The best feature set is not necessarily the set of best features.
![Page 36: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/36.jpg)
An example of a Feature Metric
“Salience” : geometric mean of class precisions • an objective measure of the ability of a feature
to distinguish classes• takes class proportion into account• specific to a particular classifier and problem• does not measure independence
![Page 37: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/37.jpg)
Nominal to Numeric Coding...…one step at a time!
Name Class Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
(habitat) (diet) (integument) (morphology) (life cycle)
Bill mammals land omnivore skin w/o feathers biped no wings live birthBubbles non-mammals sea omnivore scales no wings, non biped eggs w/o metaRover mammals land carnivore skin w/o feathers no wings, non biped live birthRingo non-mammals land herbivore exoskeleton wings, non-biped egss w. metaChuck non-mammals parasitic other other no wings, non biped otherTweety non-mammals land omnivore skin with feathers wings, biped eggs w/o meta
Name Class Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
(habitat) (diet) (integument) (morphology) (life cycle)
Bill primates land omnivore skin w/o feathers biped no wings live birthBubbles fishes sea omnivore scales no wings, non biped eggs w/o metaRover domestic land carnivore skin w/o feathers no wings, non biped live birthRingo bugs land herbivore exoskeleton wings, non-biped egss w. metaChuck bacteria parasitic other other no wings, non biped otherTweety birds land omnivore skin with feathers wings, biped eggs w/o meta
Name Class Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
(habitat) (diet) (integument) (morphology) (life cycle)
1 1 2 3 1 3 12 2 1 3 3 4 33 1 2 2 1 4 14 2 2 1 4 1 25 2 3 4 5 4 46 2 2 3 2 2 3
Original Data:
Step 1:
Step 2:
![Page 38: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/38.jpg)
Numeric to Nominal Quantization
![Page 39: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/39.jpg)
“Clusters” Usually Mean Something
![Page 40: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/40.jpg)
How many objects are shown here? One, seen from various perspectives!This illustrates the danger of using ONE METHOD/TOOL/VISUALIZATION!
![Page 41: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/41.jpg)
Autoclustering
• Automatically find spatial patterns in complex data– find patterns in data– measure the complexity of the data
![Page 42: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/42.jpg)
Differential Analysis
• Discover the Difference “Drivers” Between Groups
– Which combination of features accounts for the observed differences between groups?
– Focus research
![Page 43: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/43.jpg)
Sensitivity Analysis
• Measure the Influence of Individual Features on Outcomes– Rank order features by salience and independence
– Estimate problem difficulty
![Page 44: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/44.jpg)
Rule Induction
• Automatically find semantic patterns in complex data– discover rules directly from data– organize “raw” data into actionable knowledge
![Page 45: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/45.jpg)
A Rule Induction Example
(using data splits)
![Page 46: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/46.jpg)
Rule Induction Example (Data Splits)
![Page 47: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/47.jpg)
![Page 48: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/48.jpg)
![Page 49: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/49.jpg)
![Page 50: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/50.jpg)
![Page 51: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/51.jpg)
![Page 52: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/52.jpg)
4. Modeling
What question are we answering?
How deeply buried in the data is the answer?
How must the answer be presented to the user?
![Page 53: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/53.jpg)
![Page 54: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/54.jpg)
What question are we answering?
•Ground truth type–Nominal–Numeric–Complex (e.g., interval estimate, plan, concept)
•Ground truth data quality–Low/high SNR–Few/many gaps–Easy/hard to access–Objective/subjective
•Ground truth predictability–Correlation with features–Population balance–Class collisions
![Page 55: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/55.jpg)
How deeply buried in the data is the answer?
•Solvable by a 1 layer Multi-Layer Perceptron (easy)–Linearly separable; any two classes can be separated by a hyperplane
•Solvable by a 2 layer Multi-Layer Perceptron (moderate)–Convex hulls of classes overlap, but classes do not
•Solvable by a 3 layer Multi-Layer Perceptron (hard)–Classes overlap but do not “collide”
•“intractable”–Data contain class collisions
![Page 56: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/56.jpg)
How must the answer be presented to the user?
•Forensics–GUI, confidence factors, intervals, justification
•Integration–Web-based, Web-enable, dll/sl, fully integrated
•Accuracy–% correct, confusion matrix, lift chart
•Performance–Throughput, ease of use, accuracy, reliability
![Page 57: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/57.jpg)
![Page 58: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/58.jpg)
Text Book Neural Network
![Page 59: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/59.jpg)
![Page 60: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/60.jpg)
![Page 61: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/61.jpg)
![Page 62: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/62.jpg)
![Page 63: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/63.jpg)
![Page 64: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/64.jpg)
![Page 65: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/65.jpg)
![Page 66: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/66.jpg)
Knowledge Acquisition
What the Expert says:
KE: ...and, primates. What evidence makes you CERTAIN an animal is a primate?
KE: Yeah, well, like...If it’s a land animal that’ll eat anything...but it bears live young and walks upright,...
KE: Any obvious physical characteristics?EX: Uh...yes...and no feathers, of course, or wings, or any of that...
Well, then...then, it’s gotta be a primate...yeah.KE: So, ANY animal which is a land-dwelling, omnivorous, skin-
covered, unwinged featherless biped which bears live young is NECESSARILY a primate?
EX: Yep.KE: Could such an animal, be, say, a fish?EX: No...it couldn’t be anything but a primate.
![Page 67: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/67.jpg)
What the KE hears:
IF
(f1,f2,f3,f4,f5) = (land, omni, feathers, wingless biped, born alive)
THEN
PRIMATE and (not fish, not domestic, not bug, not germ, not bird)
![Page 68: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/68.jpg)
Evaluation
How must the answer be presented to the user?
![Page 69: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/69.jpg)
Model Evaluation
• Accuracy – Classification accuracy, geometric accuracy– precision/recall– RMS– Lift curve– Confusion matrices– ROI
• Speed, space, utility, other
![Page 70: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/70.jpg)
Classification Errors
• Type I - Accepting an item as a member of a class when it is actually false: a “false positive”.
• Type II - Rejecting an item as a member of a class when it
actually is (true) a “false negative”.
Prediction = 1 2 3 Type IIGround Truth RECALL Error
1 302 55 21 79.9% 20.1%2 128 526 194 62.0% 38.0%3 35 68 469 82.0% 18.0%
PRECISION 64.9% 81.0% 68.6%Type I Error 35.1% 19.0% 31.4%
![Page 71: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/71.jpg)
Model Maintenance
• Retraining, stationarity
• Generalization (e.g. heteroscedasticity)
• Changing the feature set (add/subtract)
• Conventional maintenance issues
![Page 72: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/72.jpg)
What do we give the user besides an application?
• Documentation
• Support
• Model retraining
• New model generation
![Page 73: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/73.jpg)
Using a Paradigm Taxonomy to Select a DM Algorithm
Place paradigms into a taxonomy by specifying their attributes. This taxonomy can be used for algorithm selection.
First, an example taxonomy….
![Page 74: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/74.jpg)
KBES (knowledge-based Expert System)
required intuition: high vector count supported: high feature count supported: medium class count supported: medium cost to develop: high schedule to develop: high talent to develop: medium, high tools to develop: can be expensive to buy/make feature types supported: nominal/numeric/complex feature mix supported: homogeneous, heterogeneous feature data quality needed: need not fill "gaps" ground truth types supported: nominal, complex relative representational power: low relative performance: fast, intuitive, robust relative weaknesses: ad hoc; relatively simple class boundaries relative strengths: intuitive; easy to provide conclusion justification
![Page 75: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/75.jpg)
MLP (Multi-Layer Perceptron)
required intuition: low vector count supported: high feature count supported: medium class count supported: medium cost to develop: low schedule to develop: medium talent to develop: medium tools to develop: easy to obtain inexpensively feature types supported: numeric feature mix supported: homogeneous feature data quality needed: must fill "gaps" ground truth types supported: nominal, numeric relative representational power: high relative performance: moderately fast relative weaknesses: inscrutable; uncontrolled regression relative strengths: easy to build
![Page 76: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/76.jpg)
RBF (Radial Basis Function)
required intuition: low vector count supported: high feature count supported: medium class count supported: high cost to develop: low schedule to develop: medium talent to develop: medium tools to develop: easy to obtain inexpensively feature types supported: numeric feature mix supported: homogeneous feature data quality needed: need not fill "gaps" ground truth types supported: nominal, numeric relative representational power: high relative performance: moderately fast relative weaknesses: inscrutable; models tend to be large relative strengths: uncontrolled regression can be mitigated
![Page 77: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/77.jpg)
SVM (Support Vector Machines)
required intuition: low vector count supported: high feature count supported: high class count supported: two cost to develop: medium schedule to develop: medium talent to develop: medium tools to develop: easy to obtain inexpensively feature types supported: numeric feature mix supported: homogeneous feature data quality needed: must fill "gaps" ground truth types supported: nominal, numeric relative representational power: high relative performance: moderately fast relative weaknesses: inscrutable; can be hard to train relative strengths: minimal need to enhance features
![Page 78: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/78.jpg)
Decision Trees (e.g., CART, BBN’s)
required intuition: low vector count supported: high feature count supported: medium class count supported: high cost to develop: low schedule to develop: medium talent to develop: medium tools to develop: easy to obtain inexpensively feature types supported: nominal, numeric feature mix supported: homogeneous, heterogeneous feature data quality needed: need not fill "gaps" ground truth types supported: nominal, numeric relative representational power: high relative performance: moderately fast relative weaknesses: many "low support" nodes or rules relative strengths: can provide insight into the domain
![Page 79: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/79.jpg)
The taxonomy can be used to match available paradigms with the
characteristics of the data mining problem to be addressed…
![Page 80: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/80.jpg)
IF the ground truth is discrete; there aren't too many classes; the class boundaries are simple; the number of features is medium; the data are heterogeneous; no comprehensive, representative data set with GT; the population is unbalanced by class; the domain is well-understood by available experts; conclusion justification is needed;THEN KBES
![Page 81: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/81.jpg)
ELSE IF the ground truth is numeric; there is a medium number of classes; the class boundaries are complex; the number of features is medium; the data are numeric; comprehensive, representative data set tagged with GT; the population is relatively balanced by class; the domain is not well-understood by available experts; conclusion justification is not needed;THEN MLP
![Page 82: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/82.jpg)
ELSE IF the ground truth is numeric or nominal; there is a large number of classes; the class boundaries are very complex; the number of features is medium; the data are numeric; representative data set tagged with GT; the population is unbalanced by class; the domain is not well-understood by available experts; conclusion justification is not needed;THEN RBF
![Page 83: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/83.jpg)
ELSE IF the ground truth is numeric or nominal; the number of classes is two; the class boundaries very complex; the number of features is very large; the data are numeric; comprehensive, representative data set tagged with GT; the population is unbalanced by class; the domain is not well-understood by available experts; conclusion justification is not needed;THEN SVM
![Page 84: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/84.jpg)
ELSE IF the ground truth is numeric or nominal; there is a medium number of classes; the class boundaries are very complex; the number of features is medium; the data are numeric, nominal, or complex; representative data set tagged with GT; the population is unbalanced by class; the domain is not well-understood by available experts; conclusion justification is needed;THEN Decision Tree (CART, BBN, etc.)
END IF
![Page 85: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/85.jpg)
Common Reasons Data Mining Projects Fail
![Page 86: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/86.jpg)
Mistakes can occur in each major element of data mining practice!
1. Specification of Enterprise Objectives– Defining “success”
2. Creation of the DM Environment– Understanding and Preparing the Data
3. Data Mining Management4a,b. Descriptive Modeling and Predictive Modeling
– Detecting and Characterizing Patterns– Building Models
5. Model Evaluation 6. Model Deployment7. Model Maintenance
![Page 87: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/87.jpg)
1. Specification of Enterprise Objectives
Define “success”:• Knowledge acquisition interviews (who, what, how)• Objective measures of performance (enterprise specific)• Assessment of enterprise process and data environment• Specification of data mining objectives
![Page 88: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/88.jpg)
Specification Mistakes
• DM projects require careful management of user expectations. Choosing the wrong person as customer interface can guarantee user disappointment.
(GIGOO: Garbage in, GOLD out!)
• Since the default assessment of “R&D type” efforts is “failure”, not defining “success” unambiguously will guarantee “failure”.
![Page 89: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/89.jpg)
2. Creation of the DM Environment
• Data Warehouse/Data Mart /Database
• Meta data and schemas
• Data dependencies
• Access paths and mechanisms
![Page 90: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/90.jpg)
Environmental Mistakes
• Big data require bigger storage. DM efforts typically work against multiple copies of the data; try 2 or 3 x.
• Unwillingness to invest in tools forces data miners to consume resources building inferior versions of what could have been purchased more cheaply.
• Get labs and network connections set up quickly.
![Page 91: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/91.jpg)
Understanding the Data
• Enterprise data survey– Data as a process artifact– Temporal Considerations
• Data Characterization– Metadata– Collection paths
• Data Metrics and Quality– currency, completeness, correctness, correlation
![Page 92: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/92.jpg)
A List of Common Data Problems
• Conformation (e.g., a dozen ways to say lat/lon)• Accessibility (distributed, sensitive)• Ground Truth (missing, incorrect)• Outliers (detect/process)• Gaps (imputation scheme)• Time (coverage, periodicity, trends, Nyquist)• Consistency (intra/inter record)• Class collisions (how to adjudicate)• Class population imbalance (balancing)• Coding/quantization
![Page 93: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/93.jpg)
Data Understanding Mistakes
• Assuming that no understanding of the domain is needed for a successful DM effort
• Temporal infeasibility: assuming every type of data you find in the warehouse will actually be there when your fielded system needs it.
• Ignoring the data conformation problem
![Page 94: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/94.jpg)
Data Preparation Mistakes • Improper handling of missing data, outliers• Improper conditioning of data• “Trojan Horsing” ground truth into the feature set• Having no plan for getting operational access to data
![Page 95: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/95.jpg)
3. Data Mining Management
• Data mining skill mix (who are the DM practitioners?)
• Data mining project planning (RAD vs. waterfall)
• Data mining project management
• Sample DM project cost/schedule
• Don’t forget Configuration Management!
![Page 96: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/96.jpg)
DM Management Mistakes• Appointing a “domain expert” as the technical lead on a DM project
virtually guarantees that no new ground will covered.• Inadequate schedule and/or budget poison the psychological atmosphere
necessary for discovery.• Failure to parallelize work• Allowing planless tinkering• Letting technical people “snow” you• Failure to conduct “process audits”
![Page 97: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/97.jpg)
Configuration Management
• Nomenclature and naming conventions• Documenting the workflow for reproducibility• Modeling Process Automation
![Page 98: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/98.jpg)
Configuration Management Mistakes
• Not having a configuration management plan (files, directories, nomenclature, audit trail) virtually guarantees that any success you have will be unreproduceable.
• Allowing each data miner to establish their own documentation and auditing procedures guarantees that no one will understand what anyone else has done.
• Failure to automate configuration management (e.g., putting annotated experiment scripts in a log) guarantees that your configuration management plan will not work.
![Page 99: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/99.jpg)
4a. Descriptive Modeling
• OLAP (on-line analytical processing)• Visualization • Unsupervised learning• Link/Market Basket Analysis• Collaborative Filtering• Rule Induction Techniques• Logistic Regression
![Page 100: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/100.jpg)
4b. Predictive Modeling
• Paradigms• Test Design• Meta-Schemes• Model Construction• Model Evaluation• Model Deployment• Model Maintenance
![Page 101: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/101.jpg)
Paradigms
• Know what they are• Know when to use which• Know how to instantiate them• Know how to validate them• Know how to maintain them
![Page 102: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/102.jpg)
Model Construction
• Architecture (monolithic, hybrid)• Formulation of Objective Function• Training (e.g., NN)• Construction (e.g., KBES) • Meta Schemes
– Bagging– Boosting– Post-process model calibration
![Page 103: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/103.jpg)
Modeling Mistakes • The “Silver Bullet Syndrome”: relying entirely on a single
tool/method
• Expecting your tools to think for you
• Overreliance on visualization
• Using tools that you don’t understand
• Not knowing when to quit (maybe this is just dirt)
• Quitting too soon (I haven’t dug deep enough)
• Picking the wrong modeling paradigm
• Ignoring population imbalance
• Overtraining
• Ignoring feature correlation
![Page 104: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/104.jpg)
5. Model Evaluation
• Blind Testing• N-fold Cross-Validation• Generalization and Overtraining
![Page 105: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/105.jpg)
Model Evaluation Mistakes
• Not validating the model• Validating the model on the training data• Not escrowing a “holdback set”
![Page 106: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/106.jpg)
6. Model Deployment
• ASP (applications service provider)
• API (application program interface)
• Other– plug-ins
– linked objects
– file interface, etc.
![Page 107: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/107.jpg)
Model Deployment Mistakes
• Not considering the fielded architecture• No user training• Not having any operational performance
requirements (except “accuracy”)
![Page 108: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/108.jpg)
7. Model Maintenance • Retraining
• Poor generalization – Heteroscedasticity– Non-stationarity– Overtraining
• Changing the problem architecture– Adding/subtracting features– Modifying ground truth
• Other
![Page 109: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/109.jpg)
Model Maintenance Mistakes
• Not having a mechanism, method, and criteria for tracking performance of the fielded model
• Not providing a model “retraining” capability• No documentation, no support
![Page 110: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/110.jpg)
Published by:Digital Press, 2001ISBN: 1-555558-231-1
![Page 111: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist](https://reader031.vdocuments.mx/reader031/viewer/2022013011/56649d945503460f94a7ba17/html5/thumbnails/111.jpg)