automated exploration of bioinformatics spaces simon colton computational bioinformatics laboratory
TRANSCRIPT
Automated Exploration of Bioinformatics Spaces
Simon Colton
Computational Bioinformatics Laboratory
Purpose of the Talk To make you aware of another tool
which may have some potential for use in the Metalog project
To get feedback on this potential
To briefly describe two other projects
The Substructure Server Old-style approach to using machine learning
(ML) for predictive toxicology– What do the positives have in common that the
negatives do not?– For chemicals, possibly using ILP is like using a
sledgehammer to crack a nut Substructures are often the answer (e.g., mutagenesis)
– Substructure server looks explicitly for substructures Vehicle for me to understand ML in predictive
toxicology and server-client technology– May even be of some use one day
Substructure ServerDevelopment
Team– Simon Colton
Prolog machine learning routine (FIND-S)
– Saravanan Anandathiyagar Server technology
– Laurence Darby Distributing the process over our linux farm
– Gives roughly 5 times speed up
– A.N.Other masters student (TBA) Front end (Babel) Back end (Molgen, etc.)
Old-Style Predictive Toxicology
Reason 1:– Using only chemistry, attributes etc.
Not using biochemical pathways
Reason 2:– Using predictive machine learning
Not using descriptive machine learning
Predictive Inductionin Bioinformatics
Interesting problem found – Interesting from a biochemistry perspective– Interesting from a computer science perspective
Packaged as prediction/classification– Turned into positives and negatives– Much work done to shoe-horn into a prediction task
Reason(s) learned why positives are positive– Almost guaranteed that any answer found will be
interesting, because the problem is interesting
Generating Hypotheses Predictive machine learning produces
hypotheses of the form:– A Toxic– Toxic C– B Toxic– D ¬Toxic– etc.
With any luck, A, B or C will be interesting in their own right– And enter the biochemistry literature!
But what if… There was an interesting relationship
– Between a concept and a subset of the positives. Isn’t this interesting?
Examples:A Toxic & B
C ¬Toxic & D & E
Predictive versus Descriptive Learning
Predictive learning– You know what you are looking for– You just don’t know what it looks like
Descriptive learning– You don’t know what you are looking for– But you want to find something interesting
Eventually:– You don’t even know you are looking for something
Descriptive Induction Not as goal directed as predictive induction Same background information given
– Perhaps no categorisation into pos & neg A theory is produced which contains:
– Examples– Concepts which categorise/describe sets of examples– Hypotheses which relate concepts– Explanations which explain the hypotheses
For instance: – Acid + Base Salt + Water
Tools are supplied so that– The user can extract interesting parts of the theory
The HR System in 3 Slides
Concept formation– Starts with background info like Progol– Builds new concepts from old ones
Using one of 15 production rules (composition, instantiation, counting, matching, etc.) Unary or binary Many settings for how concept formation occurs
– Derives examples & definition of concepts Heuristic search (if user specifies)
– Uses a best first search 20+ measures of interestingness for concepts/conjectures Chooses to build new concepts from best old ones
The HR System in 3 Slides
Conjecture Making– “Proper” induction!– Notices patterns in examples for concepts
Newly formed concept has no examples– Makes a non-existence conjecture
Two concepts have exactly the same examples– Makes an equivalence conjecture
One concept’s examples are subset of another– Makes an implication conjecture
– Extracts simpler hypotheses from empirical ones– Able to make “near-conjectures”
Patterns don’t have to be exact User specifies a tolerance level
The HR Systemin 3 Slides
Generating explanations– User supplies a set of axioms– HR appeals to a third party theorem prover
And a third party model generator (otter/mace)
– To attempt to prove/disprove That the hypothesis follows from the axioms
Sometimes, explanations are interesting– In domains such as group theory
Explanations are proofs of theorems
Sometimes, explanations show that a hypothesis is dull– Anything provable by the theorem prover is trivial
Extreme(!) Theory Formation
All my best examples are from maths Given only one concept:
– How to divide two integers HR finds the conjecture
– Odd refactorable numbers are squares Invented concepts:
– Odd, square, refactorable, (even, tau, …) Made concept of odd refactorables
– Noticed the examples are a subset of the examples for square numbers
No proof supplied (I proved this one)
What HR Can Deliver HR generates hypotheses like Progol
– But there are too many– Require filters to prune dull ones
Some concepts might be interesting aside from their relation to toxicity
HR points out interesting examples– E.g., a molecule has the only occurrence of
a particular sub-molecule
Interesting New Angle Anomaly detection First experiments in analysis of Bach chorale
melodies– Which ones were different to the rest
Not necessarily breaking rules Could be: something occurring more often
– “Parsimony outlier” measure of interestingness Hope to try this with metabolic pathways
– Give me 30 pathways I’ll give you reasons why each is unique
– Give me an invented pathway I’ll show you possible reasons it’s wrong…
What I need Objects of interest
– Pathways Background concepts
– Ways to describe the pathways Axioms
– What we know is true about pathways Measures of interestingness
– Essential to separate the wheat from chaff– Evolve over time as we use HR together
Future for my Work Form theories about biochemical data Domain of interest
– Pathways Technical problems
– Enabling HR to work with probabilistic information (not yet possible)
– Enabling HR to work with larger datasets– Understanding pathways!
The Amaze Database Bioinformatics MSc. Project
– Organised by Marek Sergot Challenge
– To resurrect the Amaze database Of biochemical pathways
– EBI originally, now Université libre de Bruxelles
– To get hold of data, put into a database, put a front-end onto this, etc.
– And write translation routines So that we can get at the information
This is a resource we should use– Please let me know your requirements