csc 599: computational scientific discovery lecture 4: machine learning and model search
Post on 15-Jan-2016
218 views
TRANSCRIPT
CSC 599: Computational Scientific Discovery
Lecture 4: Machine Learning and Model Search
Outline
Computational Reasoning in Science, cont'd Computer Algebra Bayesian nets
Brief introduction to Artificial Intelligence Search space and search operators Newell's model's of intelligence
Brief introduction to Machine Learning Error, precision and accuracy Overfitting
Computational scientific discovery vs. Machine Learning Importance of sticking to paradigm CSD vs ML: The take-home message
Computer Algebra
Forget numbers!Q: Have an ungodly amount of algebra to do?
Physics, engineeringA: Try a Computer Algebra System (CAS)!
For algebraic symbol manipulation Examples:
Mathematica Maple
(Compare: Numerical methods & stats packages) Do “number crunching” Examples:
Matlab, Mathematica SAS, SPSS
Bayesian Networks
Idea Complexity:
Lots of variables Non-deterministic environment
Simplicity: Patterns of influence between variables
Bayesian net encodes influence patternsExample:
Variables:a) Prof assigns homework? (true or false)b) TA assigns homework? (true or false)c) Will your weekend be busy? (true or false)
Bayesian Networks (2)
Example:pr=prof, ta=TA, b=busy
p(pr) = .6 p(-pr) = 0.4
p(ta|pr) = 0.1 p(-ta|pr)=0.9p(ta|-pr)= 0.9 p(-ta|-pr)=0.1
p(b|ta,pr) =0.99 p(-b|ta,pr)=0.01p(b|-ta,pr)= 0.8 p(-b|-ta,pr)=0.2p(b|ta,-pr)= 0.9 p(-b|ta,-pr)=0.1p(b|-ta,-pr)=0.1 p(-b|-ta,-pr)=0.9
Bayesian Networks (3)
P(pr=T|b=T)= P(b=T, pr=T) / P(b=T)= P(b=T, ta=T/F, pr=T) / P(b=T, ta=T/F, pr=T/F)= [(0.99*0.1*0.6= 0.0594
TTT) +(0.8*0.9*0.6 = 0.432
TFT)] /
[0.0594TTT
+0.432TFT
+0.324TTF
+0.004TFF
]= 0.599707103
Bayesian Networks (4)
Q: That's a lot of work! Can't we get the network to simplify things?
A: Yes, D-separation!Two sets of nodes X,Y are d-separated given Z if:
1. M is in Z and is the middle node (chain):i --> M --> jIntuition: if I know M, knowing i doesn't tell me any more about j
2. M is in Z and is the middle node (fork):i <-- M --> jIntuition: if I know common cause M, knowing 1st result i doesn't
tell me any more about 2nd result j3. M is NOT in Z (and none of its descendants):
i --> M <-- jIntuition: if I did know i and common result M then that would
justify why I should not believe in j.
An A.I. researcher's worldview
Problems are divided into1. Those solvable by “algorithms”
Algorithm = do these steps and you are guaranteed to get the answer in a “reasonable” time
Classic examples: searching and sorting
2. Those that aren't No way to guarantee you will get an answer (in
polynomial time) Q: What do you do? A: Search for one!
A.I. Worldview (2)
Example of an “A.I.” problem: Chess Can you guarantee that you will always win at
chess? Can you guarantee that you will (at least) never
lose? No? Well, that makes it interesting! Compare with Tic-Tac-Toe
You can guarantee that you will never lose (That's why only children play it)
A.I. Worldview (3)
A.I. paradigm for searching for a solutionRemember: no “algorithm” for obtaining answer
Need to search for one:States:
Configurations of the world
Operators: Define legal transitions from one state to another
Example: white knight g1->f3 white pawn c2->c4
A.I. Worldview (4)
State space (or search space) Space of states reachable by operators from
initial state
A.I. Worldview (5)
Goal state One or more states that have the configuration
that you want In Chess: Checkmate!
A.I. Worldview (6)
A.I. Pioneer Alan Newell's view of intelligence A given level of “intelligence” achievable with
a) Lots of knowledge and little search (Chess grandmaster)b) Little knowledge and lots of search (“stupid” program)c) Some knowledge and some search (“smart” program)
A.I. Worldview (7)
Idea1. Start at initial state2. Apply operators to traverse search space3. Hope to arrive at goal state4. Issues:
How quickly can you find the answer? (time!) How much memory do you need? (space!) How good is your goal state?
Optimal = shortest path? Optimal = shortest arc cost?
A.I. Worldview (8)
Tools Uninformed search
Depth 1st
Breadth 1st
Uniform cost (best 1st where best = least cost so far) Iterative deepening depth 1st
Informed search Heuristic function tells “desirability” of each node Greedy (best 1st where best = least estimated cost to
goal), A* (best 1st where best = uniform + greedy)
Search from: Initial state to goal state(s) Goal state to initial state(s) Both directions
Machine Learning and A.I.
ML goals Find some data structure that permits better
performance on some set of problems Prediction Conciseness Some combination thereof
What about coefficient finding numerical methods? They're “algorithms” (in the A.I. Sense)!
1. Stuff in the data2. Turn the crank3. In O(n^3) later out comes the answer
ML example Decision Tree learning:
Task: Build a decision tree that predicts a classLeaves = guessed classNon-leaf nodes = tests on attribute variables
Each edge to child represents one or more attr. values
ML example Decision Tree learning (2)
ApproachGreedy search
1. Use information theory to find best attribute to split data
2. Split data on that attribute3. Recursively Continue until either:
a) No more attributes to split on (label with majority class)b) All instances are in same class (label with that class)
ML example Decision Tree learning (3)
A bit of information theory: C
i = some class value to guess
S = some set of examples freq(C
i,S) = how many C
i's are in S
size(S) = size of SIntuition
k choices C1 . . C
k
How much information needed to specify one Ci from S?
Not many Ci's (≈ 0)? On average few bits
Each occurrence costs more than 1 bitbut not many occurrences
Lots of Ci's (≈ size(S))? Not many bits
Each occurrence less than 1 bit (good default guess)Some C
i's (≈ size(S)/2)? About 1 bit
About 1 bit each, occur about ½ the time
ML example Decision Tree learning (4)
Prob choose one class value from set:freq(C
i,S)/size(S)
Information to specify one Ci in S:
-lg( freq(Ci,S)/size(S) ) bits
For expected information multiply by class proportionsinfo(S) =- sum(i=1 to k): freq(C
i,S)/size(S) * lg(freq(C
i,S)/size(S))
ML example Decision Tree learning (5)
Let's get an intuition:Case 1: Every member of S is a C
1, none of C
2
size(S) = 10, freq(C1,S) = 10, freq(C
2,S) = 0:
Therefore:info(S)= - sum(i=1,2): freq(C
i,S)/size(S) * lg(freq(C
i,S)/size(S))
= - [ (10/10) * lg(10/10)] - [(0/10) * lg(0/10)]= -0 - 0 = 0
Intuition:“If we know that we're dealing with S, then we know that all of
it's members are in C1. No need to specify that which is C
1
and which is C2”
ML example Decision Tree learning (6)
Let's get an intuition (cont'd):Case 2: Half members of S is a C
1, half of C
2
size(S) = 10, freq(C1,S) = 5, freq(C
2,S) = 5:
Therefore:info(S)= - sum(i=1,2): freq(C
i,S)/size(S) * lg(freq(C
i,S)/size(S))
= - [ (5/10) * lg(5/10)] - [(5/10) * lg(5/10)]= -2 * (0.5 * -1) = 1
Intuition:“If we know that we're dealing with S, then its a 50-50 guess
which members belong to C1 and which to C
2. Need to
specify which (no compression possible)”
ML example Decision Tree learning (7)
Recall the plan: select “best” attr to partition on“best” = best separator classes
Information gain for some attribute:gain(attr) =
= (ave info needed to spec. a class) - (ave info needed to spec. a class after partition by attr)= info(T) – info
attr(T)
When infoattr
(T) small, classes well separated (big gain!)where:
n = number attribute valuesT
i = set where all members have same attr value v
i
infoattr
(T) = sum(i=1,n): size(Ti)/size(T) * info(T
i)
ML example Decision Tree learning (8)
Example data (should we play tennis?)Outlook Temp Humidity Windy
PlayTennis?sunny 75 70 true yessunny 80 90 true nosunny 85 85 false nosunny 72 95 false nosunny 69 70 true yesovercast 72 90 true yesovercast 83 78 false yesovercast 64 65 true yesovercast 81 75 false yesrain 71 80 true norain 65 70 true norain 75 80 false yesrain 68 80 false yesrain 70 96 false yes
ML example Decision Tree learning (9)
info(PlayTennis):= -9/14 * lg(9/14) - 5/14 * lg(5/14) = 0.940 bits
infooutlook
(PlayTennis):= 5/14 * (-2/5 * lg(2/5) - 3/5 * lg(3/5)) + 4/14 * (-4/4 * lg(4/4) - 0/4 * lg(0/4)) + 5/14 * (-3/5 * lg(3/5) - 2/5 * lg(2/5))= 0.694 bits
gain(outlook) = 0.940 – 0.694 = 0.246 bits
ML example Decision Tree learning (10)
info(PlayTennis):= -9/14 * lg(9/14) - 5/14 * lg(5/14) = 0.940 bits
infowindy
(PlayTennis):= 6/14 * (-3/6 * lg(3/6) - 3/6 * lg(3/6)) + 8/14 * (-6/8 * lg(6/8) - 2/8 * lg(2/8)) += 0.892 bits
gain(windy) = 0.940 – 0.892 = 0.048 bits
gain(outlook) > gain(windy)Test on outlook!
ML example Decision Tree learning (11)
Guarding against overfitting:
Cross-validationWant to use all data, but using test data to train is cheatingSplit data into k sets:for (i = 0; i < k; i++) { model = train_with_everything_but(i); test_with(model,i); }
Tenets of Machine Learning
Choose appropriate:Training experience
Ex: Good to have about equal number of cases of each class, even if some classes are more probable in real data
Think about how you'll test too!Target function:
Decision tree? Neural Net?Representation:
Ex: how much data:Windy in {true,false} vs. wind_speed in mph
Learning algorithm:Ex: Greedy search? Genetic algorithm? Back-
propagation?
Our Tenets of Scientific Discovery
1. Play to computers' strengths:1. Speed2. Accuracy (fingers crossed)3. Don't get bored Do exhaustive search!
Q: Hey doesn't that ignore all that AI heuristic fnc research?
2. Use background knowledge Predictive accuracy is not everything! Normal science ==> dominant paradigm Revolutionary science ==> ?
What are the Differences?
1. Background knowledge CSD values background knowledge
ML considers background knowledge
What are the Differences? (cont)
2. The process of knowledge discoveryThe ML Process is iterative:
But the CSD is iterative, and starts all over again:
1. Exhaustive Search
Tell computers to consider everything! Search space systematically Simplest --> increasingly more complex
Issues:1. How do you search systematically?
States: modelsInitial state = simplest modelGoal state = solution model
Operators: Go from one model to marginally more complex
– What is “everything”?Q: With floating pt values every different coefficient
could be a new model (x, x+dx, x+2dx, etc.)A: Generate next qualitative state, use numerical
methods to find best coefficients in that state
2. Background knowledge as inductive bias (1)
Inductive bias is necessary N training cases
But N+1 test case could be anything Want to assume something about target function Inductive Bias = what you've assumed
Common inductive biases in ML: Minimal cross-validation error (e.g. decision tree
learn) Maximal conditional independence (Bayes nets) Maximal boundary size between classes (Support
vector machines) Minimal description length (Occam's razor) Minimal feature usage (Ignore extraneous data) Same class as nearest neighbor (Locality)
2. Background knowledge as inductive bias (2)
Biases we can add/refine in CSD1. Expressible in same language as paradigm?
Re-use paradigm elements instead of inventing something “brand new”
Penalty for new objects Penalty for new attributes Penalty for new processes Penalty for new relations/operations (?) Penalty for new types of assertions (?)
2. Uses same reasoning as done in paradigm Penalty for new types of reasoning, even with old
assertions
Q: Does this mean we can never introduce a new thing?
Penalty for new objects: polywater
Polymer: a long molecule in a repetitive chain
Nikolai Fedyakin (1962 USSR)H
2O condensed in and forced thru narrow quartz capillary tubes
Measured boiling pt, freezing pt and viscositySimilar to syrup
Boris DerjaguinPopularized results (Moscow, then UK 1966)
In WestSome could replicate findingsSome could not
Penalty for new objects: polywater (2)
People concerned with contamination of H20
But precautions taken against this
Denis Rousseau (Bell Labs)Did same tests with his sweatHad same properties as “polywater”
Easier to believe in an old thing (water + organic pollutants) rather than new thing ("polywater")
Penalty for new things: Piltdown Man
Circa 1900: looking for early human fossils Neanderthals in Germany (1863) Cro-Magnon in France (1868) What about England??
Charles Dawson (1912)“I was given a skull by men in at Piltdown gravel pit”Later, got skull fragments and lower jaw
Excavating Piltdown gravels:Dawson (r)Smith Woodward (center)
Penalty for new things: Piltdown Man (2)
Royal College of Surgeons (soon after discovery)“Brain looks like modern man”
French paleontologist Marcellin Boule (1915)“Jaw from ape”
American zoologist Gerrit Smith Miller (1915)“Jaw from fossil ape”
German anatomist Franz Weidenreich (1923)“Modern human cranium + orangutan jaw w/filed
teeth”Oxford anthropologist Kenneth Page Oakley
(1953)“Skull is medieval human, lower jaw is Sarawak
orangutan, teeth are fossil chimpanzee
Penalty for new attributes:
Inertia vs. gravitational mass
Inertia mass:Resistance to motionm in F = ma
Active gravitational massAbility to attract other massesM in F = GMm/r2
Passive gravitational mass:Ability to be attracted by other massesm in F = GMm/r2
Penalty for new attributes (2)
Conceptually they are three different types of mass
No experiment has ever distinguished between them People since Newton on have tried experiments
Assume they are all the same!
Penalty for new processes: cold-fusion
Cold fusionNovel combo of old processes: catalysis + fusion
Catalysis:Hard:
A + B -> D
Easier (C = catalyst):A + C -> AC (activated catalyst)B + AC -> ABC (ready to go)ABC -> CD (easier reaction)CD -> C + D (catalyst ready to do another reaction)
Penalty for new processes: cold-fusion (2)
Fusion: how it works Get lots of energy fusing neutron-rich atoms Need a lot of energy in to get more out
Penalty for new processes: cold-fusion (3)
Fusion: Overcoming electrostatic force is hard:Current technology: need a fission bomb to do it
This is the result:
Penalty for new processes: cold-fusion (4)
Martin Fleischmann & Stanley Pons (1989)“We can do fusion at room temperature!”(No initiating nuclear bomb needed)
Electrolysis of heavy water (D2O)
“Excess heat” observed
Proposed mechanismPalladium is catalystPd + D -> Pd-DPd-D + D -> D-Pd-DD-Pd-D -> He-PD + energy!He-PD -> He + Pd
Penalty for new processes: cold-fusion (5)
Reported in New York TimesInstantly a worldwide story among scientists
ReplicationSome canOthers can't
Results:Energy:
Some get excess energyOther claim didn't calibrate/account for everything
Helium:Not enough observed for energy said to be produced(there is background Helium in the air)
Ramifications
1. Science is conservativeUse the current paradigm to guide thinking
2. Accuracy is not everythingAssertion has to “fit in” current model
Be explainable by model Use same terms as model
ML and CSD?
From ML we can get:Idea of learning as model search: Training experience Target function Representation Learning algorithm
Extra considerations for CSD: Use computers' strengths:
Speed + Accuracy + Don't Get Bored Simulation + Exhaustive search
Use of background knowledge Down right conservative about introducing new terms
Not just iterative, never ends