automated theory formation: first steps in bioinformatics
DESCRIPTION
Automated Theory Formation: First Steps in Bioinformatics. Simon Colton Computational Bioinformatics Laboratory. Machine Learning (ML) Questions. Given some background information Concepts, hypotheses (axioms) Given some positive examples And some negative examples Find me an explanation - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/1.jpg)
Automated Theory Formation:First Steps in Bioinformatics
Simon ColtonComputational Bioinformatics Laboratory
![Page 2: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/2.jpg)
Machine Learning (ML)Questions
Given some background informationConcepts, hypotheses (axioms)
Given some positive examplesAnd some negative examples
Find me an explanationWhy the positives are positive And the negatives are negative
![Page 3: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/3.jpg)
Example: Predictive ToxicologyGiven some theory from chemistry
Structure of molecules, well known substructures
Given some examples of toxic drugsAnd some examples of non-toxic drugs
Question: Why are the toxic drugs toxic?
![Page 4: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/4.jpg)
Automated Theory Formation (ATF) Questions
Given some background informationConcepts, hypotheses (axioms)
And some objects of interestNumbers, Molecules, etc.
Find something interestingInteresting things could be:
Concepts, examples, hypotheses, explanations
![Page 5: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/5.jpg)
ATF OverviewScientific theories contain (at least):
Concepts: salt, acid, baseHypotheses: acid + base => salt + waterExplanations: transfer of electrons, dissolving
So, ATF should do (at least):Concept formation, Conjecture makingHypothesis proving and disproving.
Also needs to:Measure interestingness, present results, etc.
![Page 6: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/6.jpg)
HR Theory Formation SystemDeveloped in maths
Designed to be general purpose systemConcept-based theory formation
Tries to make conceptMakes conjecture when it can’t make a conceptTries to explain conjectures
Conjecture-based theory formationFix faulty conjectures with concept formationPhD work of Alison Pease, based on Lakatos
![Page 7: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/7.jpg)
Concept Formation in HR
10 General Production RulesTake in old concepts, produce new concepts
Split
Negate
Size
SplitCompose
[a,b] : b|a
[a,n]:n = |{b:b|a}|
[a]:2=|{b:b|a}|
[a] : 2|a
[a] : not 2|a
[a]:2=|{b:b|a}| & not 2|a (Odd Prime Numbers)
![Page 8: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/8.jpg)
Conjecture MakingEmpirical checks are performed
After each attempt to invent a new conceptIf the concept has no examples
Makes non-existence conjectureIf concept has same examples as previous
Makes an equivalence conjectureIf another concept subsumes the concept
Makes an implication conjecture
![Page 9: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/9.jpg)
Conjecture Extraction
Suppose HR makes equivalence conjecture:P(a) & Q(a) R(a) & S(a)
Extracts:P(a) & Q(a) => R(a), P(a) & Q(a) => S(a)R(a) & S(a) => P(a), R(a) & S(a) => Q(a)
Tries to Extract: P(a) => R(a), Q(a) => R(a), etc.Prime implicates (require proving, though)
Important: gets Horn ClausesCan be expressed in Prolog…..
![Page 10: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/10.jpg)
Explanation GenerationIn mathematical domains
HR relies on automated theorem proversAnd Model generators
To find counterexamples
E.g., group theory: a*a=a a=id (prove easily)
In biological/chemistry domainsPossibly: visualisation tools, reaction pathways
![Page 11: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/11.jpg)
Greatest HitsPlease ask me over coffee about:
Pre-processing constraint problemsLearning properties of quadratic residuesInventing integer sequencesPuzzle generationAdding to the TPTP librarySetting mathematical tutorial questions…
![Page 12: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/12.jpg)
Long term aim in Bioinformatics
Develop an ATF system similar to HOMERBut working in biological domains
Biologist provides little background infoIn a format they are happy with
Program provides resultsIntelligent, interesting, not too much,And very little rubbish
Automated assistant for biology
![Page 13: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/13.jpg)
Short term aim in BioinformaticsHR can work with biological data
Takes input similar to Muggleton’s Progol
Use HR to solve ML problemsSee how bad an idea that is
Use theory formation to improve MLIntegrate HR and Progol somehow
![Page 14: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/14.jpg)
Naïve Approach to ML TasksGive HR the same input as Progol
Get it to form a theory
Look at the theoryExtract concepts which do well on the taski.e., they look similar to target concept
Not a goal-based approachBad idea (slow)
![Page 15: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/15.jpg)
Less Naïve ApproachImprove search using “forward look-ahead”
ICML Paper
This has evolved to “reactive search”Uses HR’s own Java interpreterHR reacts to certain events in theory formation
Scripts supplied by the user
HR also makes “near-conjectures”Faster approach, but still fairly slow
![Page 16: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/16.jpg)
Example – Mutagenesis42 DataMutagenesis similar to carcinogenisis42 drugs supplied with atom-bond details
Atom type, number & charge, bond type (1-8)
13 are mutagenic (active), 29 are not activeProgol learned this concept (88% accurate)
active(A) :- bond(A,B,C,2), bond(A,D,B,1),atm(A,D,c,21,E)
c,21 ? ?1 2
![Page 17: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/17.jpg)
HR’s ResultsUsing reactive search, four PRs, 30K stepsHR learned this concept:
active(A) :- bond(A,B,C,1), atm(B,F,21), bond(A,C,D,E)Also 88% accurateBut, Progol’s answer “better”Because higher information content (fewer ?s)Biologists sometimes want more information
Is this really a simpler answer?
?,21 ? ?1 ?
![Page 18: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/18.jpg)
But…..
HR also made these equivalence conjecturesAnd extracted them (+100 more) for us
atm(B,X,21) atm(B,c,21)atm(B,X,38) atm(B,n,38)bond(A,B,C,X1) & atm(C,X2,38) bond(A,B,C,1) & atm(C,X3,38)bond(A,X1,B,X2) & atm(B,X3,38) bond(A,B,X4,2), atm(B,X5,38)
We used these to re-write HR’s answerBy hand, but hope to automate
![Page 19: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/19.jpg)
Giving us this answer:
Remember that Progol’s Answer was:
c,21 ? ?1 2
c,21 n,38 ?1 2
So, we filled in one of the blanks!
![Page 20: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/20.jpg)
Are we making a meal of this?Yes, possibly for the mutagenesis data
I was worried about the difficulty of this problem
In the last week I’ve written a200-line Prolog program which runs quite fastAnd can be distributed over multiple processorsAnd can be easily understood by biologists
And gets these results….
![Page 21: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/21.jpg)
Template search – ResultsNice result one (88% accurate, lots of info)
c,21 n,38 o,401 2
o,402
Nice result two (95% accurate)
c,21 n,38 o,401 2
c,? c,22 ?
-0.132
c,195 c,22 h,3
0.145
17 7 1
![Page 22: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/22.jpg)
Template Search - Assumptions
Connected substructures Are interesting answersProgol’s answers are all substructures
More specific substructures are not so badBiologists may even want lots of informationDon’t forget that they want to do science
Each learned concept will be true ofAt least one active (positive) molecule
![Page 23: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/23.jpg)
Template Search - OverviewUser chooses template for substructures
?,? ?,? ?,?
User specifies how many ?s are allowedE.g., 3 out of 8 in the above template
Algorithm starts with the first positiveExtracts all substructures in the template
Then takes the next positive, for each substructure in the set
Add the LGG so that it fits both positives Don’t go under the IC limit
? ?
![Page 24: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/24.jpg)
Template Search – Final PartFor all the substructures
Take a disjunction Which achieves the best accuracy
Distribution of this algorithm possibleWe’re getting a big Linux farmPPP – Processor Per Positive
finds substructures true of one positive combine answers at the end
![Page 25: Automated Theory Formation: First Steps in Bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062301/56813255550346895d98daf1/html5/thumbnails/25.jpg)
Conclusions & Future WorkAutomated Theory Formation
May be useful to bioinformaticsUse HR’s theory to improve Progol’s results
Possibly by pre-processing Progol’s input Or by post-processing the learned concept
Template search Maybe a good idea? Possibly not new….Not bad results for the Mutagenesis42 dataset