csc 599: computational scientific discovery lecture 4: machine learning and model search

CSC 599: Computational Scientific Discovery

Lecture 4: Machine Learning and Model Search

Outline

Computational Reasoning in Science, cont'd Computer Algebra Bayesian nets

Brief introduction to Artificial Intelligence Search space and search operators Newell's model's of intelligence

Brief introduction to Machine Learning Error, precision and accuracy Overfitting

Computational scientific discovery vs. Machine Learning Importance of sticking to paradigm CSD vs ML: The take-home message

Computer Algebra

Forget numbers!Q: Have an ungodly amount of algebra to do?

Physics, engineeringA: Try a Computer Algebra System (CAS)!

For algebraic symbol manipulation Examples:

Mathematica Maple

(Compare: Numerical methods & stats packages) Do “number crunching” Examples:

Matlab, Mathematica SAS, SPSS

Bayesian Networks

Idea Complexity:

Lots of variables Non-deterministic environment

Simplicity: Patterns of influence between variables

Bayesian net encodes influence patternsExample:

Variables:a) Prof assigns homework? (true or false)b) TA assigns homework? (true or false)c) Will your weekend be busy? (true or false)


P(pr=T|b=T)= P(b=T, pr=T) / P(b=T)= P(b=T, ta=T/F, pr=T) / P(b=T, ta=T/F, pr=T/F)= [(0.99*0.1*0.6= 0.0594

TTT) +(0.8*0.9*0.6 = 0.432

TFT)] /

[0.0594TTT

+0.432TFT

+0.324TTF

+0.004TFF

]= 0.599707103


Q: That's a lot of work! Can't we get the network to simplify things?

A: Yes, D-separation!Two sets of nodes X,Y are d-separated given Z if:

1. M is in Z and is the middle node (chain):i --> M --> jIntuition: if I know M, knowing i doesn't tell me any more about j

2. M is in Z and is the middle node (fork):i <-- M --> jIntuition: if I know common cause M, knowing 1st result i doesn't

tell me any more about 2nd result j3. M is NOT in Z (and none of its descendants):

i --> M <-- jIntuition: if I did know i and common result M then that would

justify why I should not believe in j.

An A.I. researcher's worldview

Problems are divided into1. Those solvable by “algorithms”

Algorithm = do these steps and you are guaranteed to get the answer in a “reasonable” time

Classic examples: searching and sorting

2. Those that aren't No way to guarantee you will get an answer (in

polynomial time) Q: What do you do? A: Search for one!

A.I. Worldview (2)

Example of an “A.I.” problem: Chess Can you guarantee that you will always win at

chess? Can you guarantee that you will (at least) never

lose? No? Well, that makes it interesting! Compare with Tic-Tac-Toe

You can guarantee that you will never lose (That's why only children play it)

A.I. Worldview (3)

A.I. paradigm for searching for a solutionRemember: no “algorithm” for obtaining answer

Need to search for one:States:

Configurations of the world

Operators: Define legal transitions from one state to another

Example: white knight g1->f3 white pawn c2->c4

A.I. Worldview (4)

State space (or search space) Space of states reachable by operators from

initial state

A.I. Worldview (5)

Goal state One or more states that have the configuration

that you want In Chess: Checkmate!

A.I. Worldview (6)

A.I. Pioneer Alan Newell's view of intelligence A given level of “intelligence” achievable with

a) Lots of knowledge and little search (Chess grandmaster)b) Little knowledge and lots of search (“stupid” program)c) Some knowledge and some search (“smart” program)

A.I. Worldview (7)

Idea1. Start at initial state2. Apply operators to traverse search space3. Hope to arrive at goal state4. Issues:

How quickly can you find the answer? (time!) How much memory do you need? (space!) How good is your goal state?

Optimal = shortest path? Optimal = shortest arc cost?

A.I. Worldview (8)

Tools Uninformed search

Depth 1st

Breadth 1st

Uniform cost (best 1st where best = least cost so far) Iterative deepening depth 1st

Informed search Heuristic function tells “desirability” of each node Greedy (best 1st where best = least estimated cost to

goal), A* (best 1st where best = uniform + greedy)

Search from: Initial state to goal state(s) Goal state to initial state(s) Both directions

Machine Learning and A.I.

ML goals Find some data structure that permits better

performance on some set of problems Prediction Conciseness Some combination thereof

What about coefficient finding numerical methods? They're “algorithms” (in the A.I. Sense)!

1. Stuff in the data2. Turn the crank3. In O(n^3) later out comes the answer

ML example Decision Tree learning:

Task: Build a decision tree that predicts a classLeaves = guessed classNon-leaf nodes = tests on attribute variables

Each edge to child represents one or more attr. values

ML example Decision Tree learning (2)

ApproachGreedy search

1. Use information theory to find best attribute to split data

2. Split data on that attribute3. Recursively Continue until either:

a) No more attributes to split on (label with majority class)b) All instances are in same class (label with that class)


A bit of information theory: C

i = some class value to guess

S = some set of examples freq(C

i,S) = how many C

i's are in S

size(S) = size of SIntuition

k choices C1 . . C

k

How much information needed to specify one Ci from S?

Not many Ci's (≈ 0)? On average few bits

Each occurrence costs more than 1 bitbut not many occurrences

Lots of Ci's (≈ size(S))? Not many bits

Each occurrence less than 1 bit (good default guess)Some C

i's (≈ size(S)/2)? About 1 bit

About 1 bit each, occur about ½ the time


Prob choose one class value from set:freq(C

i,S)/size(S)

Information to specify one Ci in S:

-lg( freq(Ci,S)/size(S) ) bits

For expected information multiply by class proportionsinfo(S) =- sum(i=1 to k): freq(C

i,S)/size(S) * lg(freq(C

i,S)/size(S))


Let's get an intuition:Case 1: Every member of S is a C

1, none of C

2

size(S) = 10, freq(C1,S) = 10, freq(C

2,S) = 0:

Therefore:info(S)= - sum(i=1,2): freq(C


i,S)/size(S))

= - [ (10/10) * lg(10/10)] - [(0/10) * lg(0/10)]= -0 - 0 = 0

Intuition:“If we know that we're dealing with S, then we know that all of

it's members are in C1. No need to specify that which is C

1

and which is C2”


Let's get an intuition (cont'd):Case 2: Half members of S is a C

1, half of C

2

size(S) = 10, freq(C1,S) = 5, freq(C

2,S) = 5:

Therefore:info(S)= - sum(i=1,2): freq(C


i,S)/size(S))

= - [ (5/10) * lg(5/10)] - [(5/10) * lg(5/10)]= -2 * (0.5 * -1) = 1

Intuition:“If we know that we're dealing with S, then its a 50-50 guess

which members belong to C1 and which to C

2. Need to

specify which (no compression possible)”


Recall the plan: select “best” attr to partition on“best” = best separator classes

Information gain for some attribute:gain(attr) =

= (ave info needed to spec. a class) - (ave info needed to spec. a class after partition by attr)= info(T) – info

attr(T)

When infoattr

(T) small, classes well separated (big gain!)where:

n = number attribute valuesT

i = set where all members have same attr value v

i

infoattr

(T) = sum(i=1,n): size(Ti)/size(T) * info(T

i)


Example data (should we play tennis?)Outlook Temp Humidity Windy

PlayTennis?sunny 75 70 true yessunny 80 90 true nosunny 85 85 false nosunny 72 95 false nosunny 69 70 true yesovercast 72 90 true yesovercast 83 78 false yesovercast 64 65 true yesovercast 81 75 false yesrain 71 80 true norain 65 70 true norain 75 80 false yesrain 68 80 false yesrain 70 96 false yes


info(PlayTennis):= -9/14 * lg(9/14) - 5/14 * lg(5/14) = 0.940 bits

infooutlook

(PlayTennis):= 5/14 * (-2/5 * lg(2/5) - 3/5 * lg(3/5)) + 4/14 * (-4/4 * lg(4/4) - 0/4 * lg(0/4)) + 5/14 * (-3/5 * lg(3/5) - 2/5 * lg(2/5))= 0.694 bits

gain(outlook) = 0.940 – 0.694 = 0.246 bits


info(PlayTennis):= -9/14 * lg(9/14) - 5/14 * lg(5/14) = 0.940 bits

infowindy

(PlayTennis):= 6/14 * (-3/6 * lg(3/6) - 3/6 * lg(3/6)) + 8/14 * (-6/8 * lg(6/8) - 2/8 * lg(2/8)) += 0.892 bits

gain(windy) = 0.940 – 0.892 = 0.048 bits

gain(outlook) > gain(windy)Test on outlook!


Guarding against overfitting:

Cross-validationWant to use all data, but using test data to train is cheatingSplit data into k sets:for (i = 0; i < k; i++) { model = train_with_everything_but(i); test_with(model,i); }

Tenets of Machine Learning

Choose appropriate:Training experience

Ex: Good to have about equal number of cases of each class, even if some classes are more probable in real data

Think about how you'll test too!Target function:

Decision tree? Neural Net?Representation:

Ex: how much data:Windy in {true,false} vs. wind_speed in mph

Learning algorithm:Ex: Greedy search? Genetic algorithm? Back-

propagation?

Our Tenets of Scientific Discovery

1. Play to computers' strengths:1. Speed2. Accuracy (fingers crossed)3. Don't get bored Do exhaustive search!

Q: Hey doesn't that ignore all that AI heuristic fnc research?

2. Use background knowledge Predictive accuracy is not everything! Normal science ==> dominant paradigm Revolutionary science ==> ?

What are the Differences?

1. Background knowledge CSD values background knowledge

ML considers background knowledge

What are the Differences? (cont)

2. The process of knowledge discoveryThe ML Process is iterative:

But the CSD is iterative, and starts all over again:

1. Exhaustive Search

Tell computers to consider everything! Search space systematically Simplest --> increasingly more complex

Issues:1. How do you search systematically?

States: modelsInitial state = simplest modelGoal state = solution model

Operators: Go from one model to marginally more complex

– What is “everything”?Q: With floating pt values every different coefficient

could be a new model (x, x+dx, x+2dx, etc.)A: Generate next qualitative state, use numerical

methods to find best coefficients in that state

2. Background knowledge as inductive bias (1)

Inductive bias is necessary N training cases

But N+1 test case could be anything Want to assume something about target function Inductive Bias = what you've assumed

Common inductive biases in ML: Minimal cross-validation error (e.g. decision tree

learn) Maximal conditional independence (Bayes nets) Maximal boundary size between classes (Support

vector machines) Minimal description length (Occam's razor) Minimal feature usage (Ignore extraneous data) Same class as nearest neighbor (Locality)

2. Background knowledge as inductive bias (2)

Biases we can add/refine in CSD1. Expressible in same language as paradigm?

Re-use paradigm elements instead of inventing something “brand new”

Penalty for new objects Penalty for new attributes Penalty for new processes Penalty for new relations/operations (?) Penalty for new types of assertions (?)

2. Uses same reasoning as done in paradigm Penalty for new types of reasoning, even with old

assertions

Q: Does this mean we can never introduce a new thing?

Penalty for new objects: polywater

Polymer: a long molecule in a repetitive chain

Nikolai Fedyakin (1962 USSR)H

2O condensed in and forced thru narrow quartz capillary tubes

Measured boiling pt, freezing pt and viscositySimilar to syrup

Boris DerjaguinPopularized results (Moscow, then UK 1966)

In WestSome could replicate findingsSome could not

Penalty for new objects: polywater (2)

People concerned with contamination of H20

But precautions taken against this

Denis Rousseau (Bell Labs)Did same tests with his sweatHad same properties as “polywater”

Easier to believe in an old thing (water + organic pollutants) rather than new thing ("polywater")

Penalty for new things: Piltdown Man

Circa 1900: looking for early human fossils Neanderthals in Germany (1863) Cro-Magnon in France (1868) What about England??

Charles Dawson (1912)“I was given a skull by men in at Piltdown gravel pit”Later, got skull fragments and lower jaw

Excavating Piltdown gravels:Dawson (r)Smith Woodward (center)

Penalty for new things: Piltdown Man (2)

Royal College of Surgeons (soon after discovery)“Brain looks like modern man”

French paleontologist Marcellin Boule (1915)“Jaw from ape”

American zoologist Gerrit Smith Miller (1915)“Jaw from fossil ape”

German anatomist Franz Weidenreich (1923)“Modern human cranium + orangutan jaw w/filed

teeth”Oxford anthropologist Kenneth Page Oakley

(1953)“Skull is medieval human, lower jaw is Sarawak

orangutan, teeth are fossil chimpanzee

Penalty for new attributes:

Inertia vs. gravitational mass

Inertia mass:Resistance to motionm in F = ma

Active gravitational massAbility to attract other massesM in F = GMm/r2

Passive gravitational mass:Ability to be attracted by other massesm in F = GMm/r2

Penalty for new attributes (2)

Conceptually they are three different types of mass

No experiment has ever distinguished between them People since Newton on have tried experiments

Assume they are all the same!

Penalty for new processes: cold-fusion

Cold fusionNovel combo of old processes: catalysis + fusion

Catalysis:Hard:

A + B -> D

Easier (C = catalyst):A + C -> AC (activated catalyst)B + AC -> ABC (ready to go)ABC -> CD (easier reaction)CD -> C + D (catalyst ready to do another reaction)

Penalty for new processes: cold-fusion (2)

Fusion: how it works Get lots of energy fusing neutron-rich atoms Need a lot of energy in to get more out


Fusion: Overcoming electrostatic force is hard:Current technology: need a fission bomb to do it

This is the result:


Martin Fleischmann & Stanley Pons (1989)“We can do fusion at room temperature!”(No initiating nuclear bomb needed)

Electrolysis of heavy water (D2O)

“Excess heat” observed

Proposed mechanismPalladium is catalystPd + D -> Pd-DPd-D + D -> D-Pd-DD-Pd-D -> He-PD + energy!He-PD -> He + Pd


Reported in New York TimesInstantly a worldwide story among scientists

ReplicationSome canOthers can't

Results:Energy:

Some get excess energyOther claim didn't calibrate/account for everything

Helium:Not enough observed for energy said to be produced(there is background Helium in the air)

Ramifications

1. Science is conservativeUse the current paradigm to guide thinking

2. Accuracy is not everythingAssertion has to “fit in” current model

Be explainable by model Use same terms as model

ML and CSD?

From ML we can get:Idea of learning as model search: Training experience Target function Representation Learning algorithm

Extra considerations for CSD: Use computers' strengths:

Speed + Accuracy + Don't Get Bored Simulation + Exhaustive search

Use of background knowledge Down right conservative about introducing new terms

Not just iterative, never ends

csc 599: computational scientific discovery lecture 4: machine learning and model search

Documents

false ta assigns homework

result i doesnt

given z

machine learningerror

machine learningimportance

state space

falsebayesian networks

prof assigns homework