chitta baral arizona state university

85
A knowledge based approach for representing, reasoning and hypothesizing about biochemical networks Chitta Baral Arizona State University

Upload: zhen

Post on 18-Mar-2016

67 views

Category:

Documents


4 download

DESCRIPTION

A knowledge based approach for representing, reasoning and hypothesizing about biochemical networks. Chitta Baral Arizona State University. Three parts to the talk. Prediction, Explanation and Planning with respect to biochemical networks - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chitta Baral  Arizona State University

A knowledge based approach for representing, reasoning and hypothesizing about biochemical networksChitta Baral Arizona State University

Page 2: Chitta Baral  Arizona State University

Three parts to the talk

Prediction, Explanation and Planning with respect to biochemical networks

Hypothesis Generation with respect to biochemical networks

Collaborative BioCuration: CBioC

Page 3: Chitta Baral  Arizona State University

Motivation: purpose of interaction databases? Suppose: We have an almost exhaustive

database of the intracellular interactions (protein-protein, metabolic, etc.) of particular cells.

What next? How will we use this database? What if our knowledge is incomplete?

Page 4: Chitta Baral  Arizona State University

Motivation: Uses of networks & pathways Visualize the pathways Analyze the graphs of the networks Compare graphs of the networks Use pathway data in conjunction with micro-

array data analysis Do system level simulation Is that all?

Page 5: Chitta Baral  Arizona State University

Motivation: ultimate uses!

Prediction/System Simulation (Systems Biology?) Impact of particular perturbations (say caused by

a drug that introduces certain proteins to the cell membrane or into the cell)

Do the perturbations have the desired impact? Do they mess up something else? (side effects!)

But that’s not all!

Page 6: Chitta Baral  Arizona State University

Motivation: Explaining observations A phenotypical observation (leading to) OR

an observation that a particular protein or chemical has abnormally high concentration

What is wrong? What is out of the ordinary? The cause/explanation will give us

approaches to fix the problem. How deep should the explanations go? How do we compare explanations?

Page 7: Chitta Baral  Arizona State University

Motivation: Designing drugs & therapies What perturbations (when and where) need

to be made so as to make the cell behave in a particular way?

In case of cancer: prevent proliferation, induce apoptosis, prevent migration, etc.

Page 8: Chitta Baral  Arizona State University

What if knowledge is incomplete? What kind of useful reasoning can we do with

incomplete knowledge? Drug makers don’t wait till full knowledge is

available. Answer: hypothesis formation

Page 9: Chitta Baral  Arizona State University

Motivation: Use summary

The ultimate uses of signaling (metabolic, etc.) interaction databases are to do: Prediction – therapy verification; determining side

effects. Explanation -- diagnosing what is wrong. Planning – therapy and drug design.

Intermediate or immediate use Generate Hypothesis

Page 10: Chitta Baral  Arizona State University

Initial goal of our research

Use knowledge representation and reasoning techniques to: Represent interactions Reason about these interactions: prediction,

explanation, planning and hypothesis formation.

Page 11: Chitta Baral  Arizona State University

Some questions

Isn’t it a little premature? We know very little about the networks New knowledge is being constantly added

Why knowledge representation and reasoning? Why not simulation Why not use Petri nets, calculus

Why a knowledge-based approach? Why not a data base approach? What’s the difference?

Page 12: Chitta Baral  Arizona State University

Our approach : present and future Yes, prediction is kind-of same as simulation

Incompleteness of information is an issue though! But hard to do explanation generation, or

design of therapies (planning) using simulation – guesses can be verified using simulation though

The core database query languages can not express explanation or planning queries.

Dealing with incompleteness!

Page 13: Chitta Baral  Arizona State University

Dealing with incompleteness – ongoing and future work Is one of the key criteria behind a `good’

knowledge representation language when building AI systems. Need to be non-monotonic. Need to be elaboration tolerant.

Proper analysis leads to hypothesizing If certain observations can not be satisfactorily

explained by the existing knowledge about the network then use general biological knowledge to hypothesize

Page 14: Chitta Baral  Arizona State University

Motivation -- summary

Goal: To emulate the abstract reasoning done by biologists, medical researchers, and pharmacology researchers.

Types of reasoning: prediction, explanation, planning and hypothesis formation.

Current system biology approaches: mostly prediction.

Ongoing issues: Dealing with incomplete knowledge and elaboration tolerance.

Page 15: Chitta Baral  Arizona State University

Related Works

Quantitative approaches. (hybrid systems, use of differential equations)

Graphical representations. Other qualitative approaches.

Petri Nets -calculus Pathway Logic Model Checking

Page 16: Chitta Baral  Arizona State University

Overview of our approach

Represent signal network as a knowledge base that describes actions/events (biological interactions, processes). effect of these actions/events. triggering conditions of the actions/events.

To query using the knowledge base: Prediction; explanation; planning; Hypothesis generation

BioSigNet-RR (Biological Signal Network - Representation and Reasoning) and BioSigNet-RRH systems.

Page 17: Chitta Baral  Arizona State University

Foundation behind our approach Research on representing and reasoning

about dynamic systems (space shuttles, mobile robots, software agents) causal relations between properties of the world effects of actions (when can they be executed) goal specification action-plans

Research on knowledge representation, reasoning and declarative problem solving – the AnsProlog language.

Page 18: Chitta Baral  Arizona State University

An NFB signaling pathway

Page 19: Chitta Baral  Arizona State University

An NFB signaling pathway

Page 20: Chitta Baral  Arizona State University

Syntax by example

bind(TNF-,TNFR1) causes trimerized(TNFR1)

trimerized(TNFR1) triggers bind(TNFR1,TRADD)

Page 21: Chitta Baral  Arizona State University

General syntax to represent networks e causes f if f1; …; fk

g1; … ; gk causes g h1; … ; hm n_triggers e k1; … ; kl triggers e r1; … ; rl inhibits e e is an event (also referred to as an action) and

the rest are fluents (properties of the cell) For metabolic interactions: e converts g1; … ; gk to f1; …; fk if h1; … ; hm

Page 22: Chitta Baral  Arizona State University

Semantics: queries and entailment Observation part of queries

f at t a occurs_at t

Given the Network N and observation O Predict if a temporal expression holds. Explain a set of observations. Plan to achieve a goal.

Page 23: Chitta Baral  Arizona State University

Importance of a formal semantics Besides defining prediction, explanation and

planning, it is also useful in identifying: Under what restrictions the answer given by a

given (graph based) algorithm will be correct. (soundness!)

Under what restrictions a given (graph based) algorithm will find a correct answer if one exists. (completeness!)

Page 24: Chitta Baral  Arizona State University

Utility of declarative programming languages (such as AnsProlog) Allows for quick implementation of the

semantics The specification or the definition of what is an

explanation, or what is a plan becomes a program that finds explanations and plans respectively.

Page 25: Chitta Baral  Arizona State University

Prediction

Given some initial conditions and observations, to predict how the world would evolve or predict the outcome of (hypothetical) interventions.

Page 26: Chitta Baral  Arizona State University

Back to the example Binding of TNF- with

TNFR1 leads to TRADD binding with one or more of TRAF2, FADD, RIP.

TRADD binding with TRAF2 leads to over-expression of FLIP provided NIK is phosphorylated on the way.

TRADD binding with RIP inhibits phosphorylation of NIK.

TRADD binding with FADD in the absence of FLIP leads to cell death.

Page 27: Chitta Baral  Arizona State University

Prediction 1. Binding of TNF- with

TNFR1 leads to TRADD binding with one or more of TRAF2, FADD, RIP.

TRADD binding with TRAF2 leads to over-expression of FLIP provided NIK is phosphorylated on the way.

TRADD binding with RIP inhibits phosphorylation of NIK.

TRADD binding with FADD in the absence of FLIP leads to cell death.

Initial Condition bind(TNF-α,TNF-R1)

occurs at t0 Query

predict eventually apoptosis

Answer Unknown! Incomplete knowledge

about the TRADD’s bindings.

Depends on if bind(TRADD, RIP) happened or not!

Page 28: Chitta Baral  Arizona State University

Prediction 2 Binding of TNF- with

TNFR1 leads to TRADD binding with one or more of TRAF2, FADD, RIP.

TRADD binding with TRAF2 leads to over-expression of FLIP provided NIK is phosphorylated on the way.

TRADD binding with RIP inhibits phosphorylation of NIK.

TRADD binding with FADD in the absence of FLIP leads to cell death.

Initial Condition bind(TNF-α,TNF-R1)

occurs at t0 Observation

TRADD’s binding with TRAF2, FADD, RIP

Query predict eventually

apoptosis Answer: Yes!

Page 29: Chitta Baral  Arizona State University

Explanation

Given initial condition and observations, to explain why final outcome does not match expectation.

Page 30: Chitta Baral  Arizona State University

Explanation 1 Binding of TNF- with

TNFR1 leads to TRADD binding with one or more of TRAF2, FADD, RIP.

TRADD binding with TRAF2 leads to over-expression of FLIP provided NIK is phosphorylated on the way.

TRADD binding with RIP inhibits phosphorylation of NIK.

TRADD binding with FADD in the absence of FLIP leads to cell death.

Initial condition: bound(TNF-,TNFR1) at

t0 Observation:

bound(TRADD, TRAF2) at t1

Query: Explain apoptosis One explanation:

Binding of TRADD with RIP Binding of TRADD with

FADD

Page 31: Chitta Baral  Arizona State University

Planning

Given initial conditions, to plan interventions to achieve a goal.

Application in drug and therapy design.

Page 32: Chitta Baral  Arizona State University

Planning requirements

In addition to the knowledge about the pathway we need additional information about possible interventions such as: What proteins can be introduced What mutations can be forced.

Page 33: Chitta Baral  Arizona State University

Planning example Defining possible interventions:

intervention intro(DN-TRAF2) intro(DN-TRAF2) causes present(DN-TRAF2) present(DN-TRAF2) inhibits bind(TRAF2,TRADD) present(DN-TRAF2) inhibits interact(TRAF2,NIK)

Initial condition: bound(NFκB,IκB) at 0 bind(TNF-α,TNF-R1) at 0

Goal: to keep NFκB remain inactive. Query:

plan always bound(NFκB,IκB) from 0

Page 34: Chitta Baral  Arizona State University

Conclusion of part 1

From paper in ISMB 2004: Our goal in this paper was to make progress towards

developing a system (and the necessary representation language and reasoning algorithms) that can be used to represent signal networks and pathways associated with cells and reason with them.

A start was made. Defined a simple language (syntax and semantics) Defined prediction, planning and explanation A prototype implementation using AnsProlog Illustration of its applicability with respect to an NFkB

pathway.

Page 35: Chitta Baral  Arizona State University

Issues with incomplete knowledge Often one may not be able to do much

predication, explanation or planning. What then? Can reasoning help in obtaining new

knowledge? Yes, through hypothesis generation! In fact, hypothesis generation needs

reasoning!

Page 36: Chitta Baral  Arizona State University

Part II: Hypothesis Generation

Page 37: Chitta Baral  Arizona State University

Hypothesis generation Our observations can not be explained by our existing

knowledge OR the explanations given by our existing knowledge are invalidated by experiments?

Conclusion: Our knowledge needs to be augmented or revised? How? Can we use a reasoning system to predict some hypothesis that

one can verify through experimentation? Automate the reasoning in the minds of a biologist, especially

helpful when the background knowledge is humongous.

Page 38: Chitta Baral  Arizona State University

Hypothesis space

Knowledge base

No cancerCancer

p53

UV leads_to cancerHigh UV

(K,I) |= O

Page 39: Chitta Baral  Arizona State University

Issues in this tiny example

Hypothesis formation: Theory: UV leads to cancer.Observation: wild-type p53 resists the UV effect.Hypothesis: p53 is a tumor-suppressor.

Elaboration tolerance: How do we update/revise “UV leads to cancer”?

Default & NM reasoning: Normally UV leads to cancer.

UV does not lead to cancer if p53 is present.

Page 40: Chitta Baral  Arizona State University

Related Works: some prior mention of hypothesis formation HYPGENE (Karp, 1991) TRANSGENE (Darden, 1997) GenePath (Zupan et al., 2003) Robot Scientist (King et al., 2004) Database (Doherty et al., 2004) BIOCHAM (Calzone et al., 2005) PathLogic (Karp et al. 2002) Cytoscape (Shannon et al., 2003) Integrative Scheme (Su et al., 2003) Pathway Analysis (Ingenuity)

… do not use the latest advances in knowledge representation and reasoning. (eg. lack of ways to express defaults, non-monotonicity, elaboration tolerance, problem solving rules, etc.)

Page 41: Chitta Baral  Arizona State University

Hypothesis formation

Knowledge base: K Set of initial conditions: I Set of (experimental) observations: O (K,I) does not entail O To expand (K,I) to (K’, I’): (K’, I’) entails O How to expand (hypothesis space)

Explanation: expand only I Diagnosis: normality assumptions about I, minimally

abandon the normality assumptions Hypothesis formation: expand K

Page 42: Chitta Baral  Arizona State University

Construction of hypothesis space Present: manual construction, using research

literature Future: integration of multiple data sources

Protein interactions Pathway databases Biological ontologies……..Provide cues, hunches such as

A may interact with B: action interact(A,B)A-B interaction may have effect C:

interact(A,B) causes C

Page 43: Chitta Baral  Arizona State University

Generation of hypotheses

Enumeration of hypotheses Search: computing with Smodels (an

implementation of AnsProlog) Heuristics

A trigger statement is selected only if it is the only cause of some action occurrence that is needed to explain the novel observations.

An inhibition statement is selected only if it is the only blocker of some triggered action at some time.

Maximizing preferences of selected statements

Page 44: Chitta Baral  Arizona State University

Generation … (cont’): heuristics Knowledge base K

a causes g b causes g

Initial condition I = { intially f } Observation O = { eventually g } (K,I) does not entail O Hypothesis space: to expand K with rules among

f triggers a f triggers b

Hypotheses: { f triggers a }, or { f triggers b }

Page 45: Chitta Baral  Arizona State University

Case study: p53 network

Page 46: Chitta Baral  Arizona State University

Tumor suppression by p53

p53 has 3 main functional domains N terminal transactivator domain Central DNA-binding domain C terminal domain that recognizes DNA damage

Appropriate binding of N terminal activates pathways that lead to protection of cell from cancer.

Inappropriate binding (say to Mdm2) inhibits p53 induced tumor suppression.

Page 47: Chitta Baral  Arizona State University

p53 knowledge base

Stress high(UV ) triggers upregulate(mRNA(p53))

Upregulation of p53 upregulate(mRNA(p53)) causes high(mRNA(p53)) high(mRNA(p53)) triggers translate(p53) translate(p53) causes high(p53)

Page 48: Chitta Baral  Arizona State University

p53 knowledge base (cont.)

Tumor suppression by p53 high(p53) inhibits growth(tumor)

Page 49: Chitta Baral  Arizona State University

p53 knowledge base (cont’)

Interaction between Mdm2 and p53 high(p53), high(mdm2) triggers bind(p53,mdm2) bind(p53,mdm2) causes bound(dom(p53,N)) bind(p53,mdm2) causes high([p53 : mdm2]), bind(p53,mdm2) causes ¬high(p53),¬high(mdm2)

Page 50: Chitta Baral  Arizona State University

Hypothesis formation

Experimental observation: I = { initially high(UV), high(mdm2), high(ARF) } O = { eventually ~ tumorous }

(K,I) does not entail O Need to hypothesize the role of ARF.

Page 51: Chitta Baral  Arizona State University

Constructing hypothesis space

Levels of ARF and p53 correlate high(ARF) triggers upregulate(mRNA(p53)) high(p53) triggers upregulate(mRNA(ARF))

Page 52: Chitta Baral  Arizona State University

Interactions of ARF with the known proteins bind(p53,ARF) causes bound(dom(p53,N))

Constructing …(cont’)

Page 53: Chitta Baral  Arizona State University

Influence of X (=ARF) on other interactions high(ARF) triggers upreg(mRNA(p53)) high(ARF) triggers translate(p53) high(ARF) triggers bind(p53,mdm2)

Constructing …(cont’)

Page 54: Chitta Baral  Arizona State University

Twelve Generated Hypothesis such as

high(UV) triggers upregulate(mRNA(ARF)) high(ARF), high(mdm2) triggers bind(ARF,mdm2)

Page 55: Chitta Baral  Arizona State University

Conclusion of part 2 Goal: Automation of hypothesis formation (with respect to

interactions and pathways) Approach: Viewed known qualitative aspects of cell activities as a

knowledge base Used knowledge representation language that

Can express defaults Allows reasoning with incomplete knowledge Can express reasoning as well as problem solving

rules Developed a system BioSigNet-RRH:

Formalizing and reasoning about hypotheses Illustration: Hypothesizing the role of ARF protein in the p53

network.

Page 56: Chitta Baral  Arizona State University

Future Work on Reasoning about Biochemical Networks (Part I and II) Further development of the language Validation with respect to larger networks

Kohn’s map Networks in Reactome and other repositories

Going from prototype to deployable systems Scaling up challenges

Recent advances in automatic planning Integration with Biopax

Page 57: Chitta Baral  Arizona State University

Part III: CBioC

http://cbioc.org

Page 58: Chitta Baral  Arizona State University

Do we have enough knowledge in the various databases Some have been curated into databases. But there is much more in the literature. So what do we do?

Page 59: Chitta Baral  Arizona State University

Current status of curation from text About 15 million abstracts in Pubmed

3 million published by US and EU researchers during 1994-2004 (800 articles per day)

300 K articles published so far reporting protein-protein interactions in human, yeast and mouse. BIND (in 7 yrs) -- 23K ; DIP – 3K; MINT – 2.4K.

Page 60: Chitta Baral  Arizona State University

Premise: High cost of human curation Overwhelming cost of large curation efforts

may be unsustainable for long periods BIND: Nov 2005 bad news.

Operated for 7 years Listed over 100 curators & programmers CND $29 million received in 2003, plus other funding

Curation efforts of AFCS has recently stopped. Lack of funding for some genome annotation

projects.

Page 61: Chitta Baral  Arizona State University

Premise: summary

Human curation of text is expensive. Human curation of text is not scalable. Human curation of text is not sustainable.

Page 62: Chitta Baral  Arizona State University

Why not resort to computers? – do automatic extraction Lessons from DARPA funded MUCs (message

understanding conferences) in 90s for a decade and at the cost of tens of millions of dollars. Getting to 60% recall and precision is quick Then every 5% improvement is about a years work. Even when we get to 90% for an individual entity extraction

for recognizing 4 related entities: (.9)4 =.64 Lessons from Biomedical text extraction

No proper evaluation. Recognized that recall and precision is not very good even

in the “best” systems.

Page 63: Chitta Baral  Arizona State University

What do we do?

How do we curate not only the existing articles, but also the future articles?

Too important to give up! Need to think of a new way to do it. Faster computers, better sequencing

technology and better algorithms came to the rescue of the Human Genome project.

Hmm. What resources are we overlooking?

Page 64: Chitta Baral  Arizona State University

Key Idea

If lots of articles are being written then lot of people are writing them and lot of people are reading them.

If only we could make these people (the authors and the readers) contribute to the curation effort …

Especially the readers; the ones who need the curated data!

Page 65: Chitta Baral  Arizona State University

Mass collaboration has worked in Wikipedia Project Gutenberg Netflix rating Amazon rating Etc.

Page 66: Chitta Baral  Arizona State University

Mass collaborative curation: initial hurdles An average reader

(S)he is not normally interested in filling a blank curation form.

We can not make an average reader go though curation training.

So it has to be very different from just making the existing curation tools available to the mass and expect them to contribute.

Page 67: Chitta Baral  Arizona State University

Mass collaborative curation : key initial ideas Make it very easy:

user need not remember where (which database, which web page) to put the curated knowledge.

Curation opportunity should present itself seamlessly.

Curation should not be a burden to an average user Make the curated knowledge “thin”.

There should be immediate rewards Do not start with a blank slate.

Page 68: Chitta Baral  Arizona State University

Realization of the key ideas: a biologist with a gene name Goes to Pubmed, types the gene name, clicks on

one of the abstracts Curation panel presents itself automatically

Our approach calls for researchers to contribute to the curation of facts as they read and research over the web

But not with a blank slate No one wants to be the first one! Automatic extraction jump-starts the process, and then

researchers improve upon the extracted data, “ironing out” inconsistencies by subsequent edits on a massive scale.

Thin Schemas Average users turned off by traditional wide schemas Wide schemas need to be broken down.

Page 69: Chitta Baral  Arizona State University
Page 70: Chitta Baral  Arizona State University
Page 71: Chitta Baral  Arizona State University
Page 72: Chitta Baral  Arizona State University
Page 73: Chitta Baral  Arizona State University
Page 74: Chitta Baral  Arizona State University
Page 75: Chitta Baral  Arizona State University
Page 76: Chitta Baral  Arizona State University
Page 77: Chitta Baral  Arizona State University
Page 78: Chitta Baral  Arizona State University
Page 79: Chitta Baral  Arizona State University

Summary

Information/curation window pops up automatically. Automatic extraction is used as a boot strap so that

no user is working on a blank slate. Users vote on correctness, make corrections, add

fact. Suppose 60% precision and recall of automatic extraction

system A person will have an easier time discarding 40% of

wrongly extracted text than identifying 60% of correct entries and entering them!

Page 80: Chitta Baral  Arizona State University

Very useful byproducts

Avoids some problems with existing human curation approach Curators’ bias Curators miss things Curators have disagreements Slow access to newest findings Researchers at large have little or no control over what

gets curated and when A large curated corpus of text gets created

Very useful to evaluate and improve automated extraction systems.

Page 81: Chitta Baral  Arizona State University

Current status of CBioC; future plans Basic system, as described, is ready Being populated with

Facts from existing databases (BIND etc.) Facts extracted using our extraction system

Querying mechanism Answer display

Future work Voter confidence issues …

Page 82: Chitta Baral  Arizona State University

Conclusion

Collecting what is known Reasoning with what is known Hypothesizing what is unknown

(based on observations)

Page 83: Chitta Baral  Arizona State University

Open Invitation

We are building and eager to help other groups build knowledge bases in particular domains to Predict impact of interventions Plan (therapy design) to make a pathway behave

in a desired way Explain observation Hypothesize new knowledge Further improvements to and adaptation of CBioC

Page 84: Chitta Baral  Arizona State University

Acknowledgements

BioSignet Nam Tran, Ph.D thesis on this, Postdoc @ Yale Karen Chancellor, Ph.D student Michael Berens and his group (Ana Joy, Nhan Tran) Lokesh Joshi and his group (Vinay Nagraj)

CBioc: Graciela Gonzalez, Lian Yu, Luis Tari, Tony Gitter, Amanda Ziegler, Ryan Wendt, Prabhdeep Singh.

Other projects: BioQA Biogenenet

Page 85: Chitta Baral  Arizona State University

Thank you!