machine learning in bioinformatics simon colton the computational bioinformatics laboratory

35
Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Upload: frederica-hardy

Post on 23-Dec-2015

238 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Machine Learning in Bioinformatics

Simon Colton

The Computational Bioinformatics Laboratory

Page 2: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Talk Overview

Our research group– Aims, people, publications

Machine learning– A balancing act

Bioinformatics– Holy grails

Our bioinformatics research projects– From small to large

A future direction– Integration of reasoning techniques

Page 3: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Computational BioinformaticsLaboratory

Our aim is to:– Study the theory, implementation and application of

computational techniques to problems in biology and medicine Our emphasis is on:

– Machine learning representations, algorithms and applications Our favourite techniques are:

– ILP, SLPs, ATF, ATP, CSP, GAs, SVMs– Kernel methods, Bayes nets, Action Languages

The (major) research tools we’ve produced are:– Progol, HR, MetaLog (in production)

Page 4: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

The Research Group Members Hiroaki Watanabe (RA, BBSRC) Alireza Tamaddoni-Nezhad (RA, DTI) Stephen Muggleton (Professor) Ali Hafiz (PhD) Huma Lodhi (RA, DTI) Simon Colton (Lecturer) Jung-Wook Bang (RA, DTI) (Nicos Angeloupolos, now in York) (RA, BBSRC)

Room 407 – http://www.doc.ic.ac.uk/bioinformatics

Page 5: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Some External Collaborators

Mike Sternberg (Biochemistry, Imperial) Jeremy Nicholson (Biomedical Sciences, Imperial) Steve Oliver (Biology, Manchester) Ross King (Computing, Aberystwyth) Doug Kell (Chemistry, Manchester) Chris Rawlings (Oxagen) Charlie Hodgman (GSK) Alan Bundy (Informatics, Edinburgh) Toby Walsh (Cork Constraint Computation Centre)

Page 6: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Some Departmental Collaborators Krysia Broda, Allesandro Russo, Oliver Ray

– Aspects of ILP and ALP

Marek Sergot– Action Languages

Tony Kakas (Visiting professor, Cyprus)– Abductive Logic Programming

Page 7: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Machine Learning Overview

Ultimately about writing programs which improve with experience– Experience through data– Experience through knowledge– Experience through experimentation (active)

Some common tasks:– Concept learning for prediction– Clustering– Association rule mining

Page 8: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Maintaining a Balance

Predictivetasks

Descriptivetasks

Supervisedlearning

Unsupervisedlearning

Know what you’re looking for

Don’t knowyou’re even

looking

Don’t know what you’re looking for

Page 9: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

A Partial Characterisation of Learning Tasks

Concept learning Outlier/anomaly detection Clustering Concept formation Conjecture making Puzzle generation Theory formation

Page 10: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Maintaining a Balance in Predictive/Descriptive tasks

Predictive tasks– From accuracy to understanding– Need to show statistical significance

• But hypotheses generated often need to be understandable

– Difference between the stock market and biology Descriptive tasks

– From pebbles to pearls– Lots of rubbish produced

• Cannot rely on statistical significance

– Have to worry about notions of interestingness• And provide tools to extract useful information from output

Page 11: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Maintaining a Balance in Scientific Discovery tasks Machine learning researchers

– Are generally not domain scientists also Extremely important to collaborate

– To provide interesting projects• Remembering that we are scientists not IT consultants

– To gain materials• Data, background knowledge, heuristics,

– To assess the value of the output

Page 12: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Inductive Logic Programming

Concept/rule learning technique (usually)– Hypotheses represented as Logic Programs

Search for LPs – From general to specific or vice-versa

• One method is inverse entailment

– Use measures to guide the search• Predictive accuracy and compression (info. theory)

– Search performed within a language bias Produces good accuracy and understanding

– Logic programs are easier to decipher than ANNs Our implementation: Progol (and others)

Page 13: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Example learned LP

fold('Four-helical up-and-down bundle',P) :- helix(P,H1), length(H1,hi), position(P,H1,Pos), interval(1 =< Pos =< 3), adjacent(P,H1,H2), helix(P,H2).

Predicting protein folds from helices

Page 14: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Stochastic Logic Programs

Generalisation of HMMs Probabilistic logic programs

– More expressive language than LPs– Quantative rather than qualitative

• Express arbitrary intervals over probability distributions Issues in learning SLPs

– Structure estimation– Parameter estimation

Applications– More appropriate for biochemical networks

Page 15: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Automated Theory Formation

Descriptive learning technique– Which can also be used for prediction tasks

Cycle of activity– Form concepts, make hypotheses, explain hypotheses,

evaluate concepts, start again,…– 15 production rules for concepts– 7 methods to discover and extract conjectures– Uses third party software to prove/disprove (maths)– 25 heuristic measures of interestingness

Project: see whether this works in bioinformatics Our implementation: HR

Page 16: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Other Machine Learning Methods used in our Group

Genetic algorithms– To perform ILP search (Alireza)

Bayes nets– Introduction of hidden nodes (Philip)

Kernel methods– Relational kernels for SVMs and regression (Huma)

Action Languages– Stochastic (re)actions (Hiraoki)

Page 17: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Bioinformatics Overview “Bioinformatics is the study of information content

and information flow in biological systems and proceses” (Michael Liebman)– Not just storage and analysis of huge DNA sequences

“Bioinformaticians have to be a Jack of all trades and a master of one” (Charlie Hodgman, GSK)

Highly collaborative– biology, mathematics, statistics, computer science,

biochemistry, physics, chemistry, medicine, …

Page 18: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

From Sequence to Structure

MRPQAPGSLVDPNEDELRMAPWYWGRISREEAKSILHGKPDGSFLVRDALSMKGEYTLTLMKDGCEKLIKICHMDRKYGFIETDLFNSVVEMINYYKENSLSMYNKTLDITLSNPIVRAREDEESQPHGDLCLLSNEFIRTCQLLQNLEQNLENKRNSFNAIREELQEKKLHQSVFGNTEKIFRNQIKLNESFMKAPADA……

attcgatcgatcgatcgatcaggcgcgctaCgagcggcgaggacctcatcatcgatcag…

There is a computer program…?

Page 19: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Holy Grail Number One

From protein sequence to protein function HGP data needs to be interpreted

– Genome split into genes, which code for a protein– Biological function of protein dictated by structure

Structure of many proteins already determined– By X-ray crystallography

Best idea so far: given a new gene sequence– Find sequence most similar to it with known structure

• And look at the structure/function of the protein Other alternatives

– Use ML techniques to predict where secondary structures will occur (e.g., hairpins, alpha-helices, beta-sheets)

Page 20: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Holy Grail Number Two

Drug companies lose millions– Developing drugs which turn out to be toxic

Predictive Toxicology– Determine in advance which will be toxic

Approach 1: Mapping molecules to toxicity– Using ML and statistical techniques

Approach 2: – Producing metabolic explanations of toxic effects– Using probabilistic logics to represent pathways

• And learning structures and parameters over this

Page 21: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Other aims of Bioinformatics

Organisation of Data– Cross referencing– Data integration is a massive problem

Analysing data from – High-throughput methods for gene expression– Ask Yike about this!

Produce Ontologies– And get everyone to use them?

Page 22: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Some Current Bioinformatics Projects SGC

– The Substructure Server SGC and SHM

– Discovery in medical ontologies SHM

– Studying biochemical networks (£400k, BBSRC)– Closed loop learning (£200k, EPSRC)– The Metalog project (£1.1 million, DTI)– APRIL 2 (£400k, EC)

Page 23: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

A Substructure Server

Lesson from Automated Theorem Proving– Best (most complex) methods not most used

• Other considerations: ease of use, stability, simplicity, e.g., Otter

Aim: provide a simple predictive toxicology program– Via a server with a very simple interface

Sub-projects– Find substructures in many positives, few negatives: Colton

• Simple Prolog program, writing Java version, use ILP??

– Put program on server: Anandathiyagar (MSc.)– Distribute process over our Linux cluster: Darby (MEng.)– Babel preprocessor (50+ repns), Rasmol back-end: ???

Page 24: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

The Substructure Server

Page 25: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Using Medical Ontologies

Use Ontology and ML for database integration– Muggleton and Tamaddoni-Nezhad – Bridge between two disparate databases

• LIGAND (biochemical reactions)• Enzyme classification system (EC) = ontology

Automated ontology maintenance– Colton and Traganidas (MSc. Last year)– Gene Ontology (big project)– Use data to find links between GO terms

• Equivalence and implication finding using HR

Page 26: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Gene Ontology Discovery

55%

Page 27: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Studying Biochemical Networks

Use SLPs to find mappings between genomes – Map function of pairs of homologous proteins

• E.g., mouse and human– Homology is probabilistic

Developed SLP learning algorithms Initial results applying them in biological networks Work by

– Muggleton, Angeloupolos and Watanabe

Page 28: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Closed Loop Machine Learning

Active learning– Information theoretic algorithm designs and chooses the most

informative and lowest cost experiments to carry out• Implemented in the ASE-Progol system

– Learning generates hypotheses– Being studied by Ali Hafiz (PhD)

Idea: use machine learning to guide experimentation– using a real robot geneticist in a cyclic process

Aims of current project: determine the function of genes Cost savings of 2 to 4 times over alternatives Upcoming Nature article

Page 29: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

APRIL 2

Applications of Probabalistic Relational Induction in Logic

Aim: develop representations and learning algorithms for probabilistic logics

Applications: bioinformatics– Metabolic networks– Phylo-genetics

2 RAs at Imperial (with Mike Sternberg)– Starting in January

Page 30: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

The Metalog Project Overview Aim:

– Modelling disease pathways and predicting toxicity– Gap filling: existing representations correct but incomplete – Predict where the toxin is acting (focus)

Multi-layered problem representation – Meta-network level (Bayes nets) Philip– Network level (SLPs) Huma– Biochemical reaction level (LPs) Alireza– Problog lingua-franca developed

• to represent learned knowledge NMR Data from metabonomics from Jeremy Nicholson KEGG Background knowledge from Mike Sternberg

Page 31: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

The Metalog Project Progress

Year 1 achievements (all objectives achieved) Function predictions from LIGAND Mapping between KEGG and metabolic networks Initial Bayes-net model

– Drawn much interest from experts• Agrees with KEGG, and disagrees in interesting ways• Interaction between metabolytes which are not explained

Year 2– Working towards abductive model for gap filling

Page 32: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Future Directions for Machine Learning in Bioinformatics In-silico modelling of complete organisms

Representation and reasoning at all levels– From patient to the molecule

Probabalistic models– For more complex biological processes

• Such as biochemical pathways

Page 33: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Biochemical Pathways

1/120th of a biochemical network

Page 34: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

Future Directions for My Research Descriptive Induction meets Biology data Most ML bioinformatics projects are predictive

– Very carefully compressed notions of interestingness• Into a single measure: predictive accuracy• Domain scientist not bombarded with a lot of information• A correctly answered question can be highly revealing

Can we push this envelope slightly?– Use descriptive induction (WARMR, CLAUDIEN, HR)

• To tell biologists something they weren’t expecting about the data they have collated

– Have to worry hard about dull output• Need to determine heuristics from domain scientists

Page 35: Machine Learning in Bioinformatics Simon Colton The Computational Bioinformatics Laboratory

More Future Directions

Put “Automated Reasoning” back together again– Essential for scientific discovery

ML, ATP, CSP, etc., all work well individually– Surely work better in combination…

Improve ATP to prove a different theorem?– Make flexible using CSP and ATP

Improve ML by rationalising input concepts?– Use ATF and ATP to find concepts and hypotheses

Improve CSP by introducing additional constraints– Use ATF, ML to find constraints, ATP to prove them