chitta baral arizona state university

A knowledge based approach for representing, reasoning and hypothesizing about biochemical networksChitta Baral Arizona State University

Three parts to the talk

Prediction, Explanation and Planning with respect to biochemical networks

Hypothesis Generation with respect to biochemical networks

Collaborative BioCuration: CBioC

Motivation: purpose of interaction databases? Suppose: We have an almost exhaustive

database of the intracellular interactions (protein-protein, metabolic, etc.) of particular cells.

What next? How will we use this database? What if our knowledge is incomplete?

Motivation: Uses of networks & pathways Visualize the pathways Analyze the graphs of the networks Compare graphs of the networks Use pathway data in conjunction with micro-

array data analysis Do system level simulation Is that all?

Motivation: ultimate uses!

Prediction/System Simulation (Systems Biology?) Impact of particular perturbations (say caused by

a drug that introduces certain proteins to the cell membrane or into the cell)

Do the perturbations have the desired impact? Do they mess up something else? (side effects!)

But that’s not all!

Motivation: Explaining observations A phenotypical observation (leading to) OR

an observation that a particular protein or chemical has abnormally high concentration

What is wrong? What is out of the ordinary? The cause/explanation will give us

approaches to fix the problem. How deep should the explanations go? How do we compare explanations?

Motivation: Designing drugs & therapies What perturbations (when and where) need

to be made so as to make the cell behave in a particular way?

In case of cancer: prevent proliferation, induce apoptosis, prevent migration, etc.

What if knowledge is incomplete? What kind of useful reasoning can we do with

incomplete knowledge? Drug makers don’t wait till full knowledge is

available. Answer: hypothesis formation

Motivation: Use summary

The ultimate uses of signaling (metabolic, etc.) interaction databases are to do: Prediction – therapy verification; determining side

effects. Explanation -- diagnosing what is wrong. Planning – therapy and drug design.

Intermediate or immediate use Generate Hypothesis

Initial goal of our research

Use knowledge representation and reasoning techniques to: Represent interactions Reason about these interactions: prediction,

explanation, planning and hypothesis formation.

Some questions

Isn’t it a little premature? We know very little about the networks New knowledge is being constantly added

Why knowledge representation and reasoning? Why not simulation Why not use Petri nets, calculus

Why a knowledge-based approach? Why not a data base approach? What’s the difference?

Our approach : present and future Yes, prediction is kind-of same as simulation

Incompleteness of information is an issue though! But hard to do explanation generation, or

design of therapies (planning) using simulation – guesses can be verified using simulation though

The core database query languages can not express explanation or planning queries.

Dealing with incompleteness!

Dealing with incompleteness – ongoing and future work Is one of the key criteria behind a `good’

knowledge representation language when building AI systems. Need to be non-monotonic. Need to be elaboration tolerant.

Proper analysis leads to hypothesizing If certain observations can not be satisfactorily

explained by the existing knowledge about the network then use general biological knowledge to hypothesize

Motivation -- summary

Goal: To emulate the abstract reasoning done by biologists, medical researchers, and pharmacology researchers.

Types of reasoning: prediction, explanation, planning and hypothesis formation.

Current system biology approaches: mostly prediction.

Ongoing issues: Dealing with incomplete knowledge and elaboration tolerance.

Related Works

Quantitative approaches. (hybrid systems, use of differential equations)

Graphical representations. Other qualitative approaches.

Petri Nets -calculus Pathway Logic Model Checking

Overview of our approach

Represent signal network as a knowledge base that describes actions/events (biological interactions, processes). effect of these actions/events. triggering conditions of the actions/events.

To query using the knowledge base: Prediction; explanation; planning; Hypothesis generation

BioSigNet-RR (Biological Signal Network - Representation and Reasoning) and BioSigNet-RRH systems.

Foundation behind our approach Research on representing and reasoning

about dynamic systems (space shuttles, mobile robots, software agents) causal relations between properties of the world effects of actions (when can they be executed) goal specification action-plans

Research on knowledge representation, reasoning and declarative problem solving – the AnsProlog language.

An NFB signaling pathway

Syntax by example

bind(TNF-,TNFR1) causes trimerized(TNFR1)

trimerized(TNFR1) triggers bind(TNFR1,TRADD)

General syntax to represent networks e causes f if f1; …; fk

g1; … ; gk causes g h1; … ; hm n_triggers e k1; … ; kl triggers e r1; … ; rl inhibits e e is an event (also referred to as an action) and

the rest are fluents (properties of the cell) For metabolic interactions: e converts g1; … ; gk to f1; …; fk if h1; … ; hm

Semantics: queries and entailment Observation part of queries

f at t a occurs_at t

Given the Network N and observation O Predict if a temporal expression holds. Explain a set of observations. Plan to achieve a goal.

Importance of a formal semantics Besides defining prediction, explanation and

planning, it is also useful in identifying: Under what restrictions the answer given by a

given (graph based) algorithm will be correct. (soundness!)

Under what restrictions a given (graph based) algorithm will find a correct answer if one exists. (completeness!)

Utility of declarative programming languages (such as AnsProlog) Allows for quick implementation of the

semantics The specification or the definition of what is an

explanation, or what is a plan becomes a program that finds explanations and plans respectively.

Prediction

Given some initial conditions and observations, to predict how the world would evolve or predict the outcome of (hypothetical) interventions.

Back to the example Binding of TNF- with

TNFR1 leads to TRADD binding with one or more of TRAF2, FADD, RIP.

TRADD binding with TRAF2 leads to over-expression of FLIP provided NIK is phosphorylated on the way.

TRADD binding with RIP inhibits phosphorylation of NIK.

TRADD binding with FADD in the absence of FLIP leads to cell death.

Prediction 1. Binding of TNF- with





Initial Condition bind(TNF-α,TNF-R1)

occurs at t0 Query

predict eventually apoptosis

Answer Unknown! Incomplete knowledge

about the TRADD’s bindings.

Depends on if bind(TRADD, RIP) happened or not!

Prediction 2 Binding of TNF- with





Initial Condition bind(TNF-α,TNF-R1)

occurs at t0 Observation

TRADD’s binding with TRAF2, FADD, RIP

Query predict eventually

apoptosis Answer: Yes!

Explanation

Given initial condition and observations, to explain why final outcome does not match expectation.

Explanation 1 Binding of TNF- with





Initial condition: bound(TNF-,TNFR1) at

t0 Observation:

bound(TRADD, TRAF2) at t1

Query: Explain apoptosis One explanation:

Binding of TRADD with RIP Binding of TRADD with

FADD

Planning

Given initial conditions, to plan interventions to achieve a goal.

Application in drug and therapy design.

Planning requirements

In addition to the knowledge about the pathway we need additional information about possible interventions such as: What proteins can be introduced What mutations can be forced.

Planning example Defining possible interventions:

intervention intro(DN-TRAF2) intro(DN-TRAF2) causes present(DN-TRAF2) present(DN-TRAF2) inhibits bind(TRAF2,TRADD) present(DN-TRAF2) inhibits interact(TRAF2,NIK)

Initial condition: bound(NFκB,IκB) at 0 bind(TNF-α,TNF-R1) at 0

Goal: to keep NFκB remain inactive. Query:

plan always bound(NFκB,IκB) from 0

Conclusion of part 1

From paper in ISMB 2004: Our goal in this paper was to make progress towards

developing a system (and the necessary representation language and reasoning algorithms) that can be used to represent signal networks and pathways associated with cells and reason with them.

A start was made. Defined a simple language (syntax and semantics) Defined prediction, planning and explanation A prototype implementation using AnsProlog Illustration of its applicability with respect to an NFkB

pathway.

Issues with incomplete knowledge Often one may not be able to do much

predication, explanation or planning. What then? Can reasoning help in obtaining new

knowledge? Yes, through hypothesis generation! In fact, hypothesis generation needs

reasoning!

Part II: Hypothesis Generation

Hypothesis generation Our observations can not be explained by our existing

knowledge OR the explanations given by our existing knowledge are invalidated by experiments?

Conclusion: Our knowledge needs to be augmented or revised? How? Can we use a reasoning system to predict some hypothesis that

one can verify through experimentation? Automate the reasoning in the minds of a biologist, especially

helpful when the background knowledge is humongous.

Hypothesis space

Knowledge base

No cancerCancer

p53

UV leads_to cancerHigh UV

(K,I) |= O

Issues in this tiny example

Hypothesis formation: Theory: UV leads to cancer.Observation: wild-type p53 resists the UV effect.Hypothesis: p53 is a tumor-suppressor.

Elaboration tolerance: How do we update/revise “UV leads to cancer”?

Default & NM reasoning: Normally UV leads to cancer.

UV does not lead to cancer if p53 is present.

Related Works: some prior mention of hypothesis formation HYPGENE (Karp, 1991) TRANSGENE (Darden, 1997) GenePath (Zupan et al., 2003) Robot Scientist (King et al., 2004) Database (Doherty et al., 2004) BIOCHAM (Calzone et al., 2005) PathLogic (Karp et al. 2002) Cytoscape (Shannon et al., 2003) Integrative Scheme (Su et al., 2003) Pathway Analysis (Ingenuity)

… do not use the latest advances in knowledge representation and reasoning. (eg. lack of ways to express defaults, non-monotonicity, elaboration tolerance, problem solving rules, etc.)

Hypothesis formation

Knowledge base: K Set of initial conditions: I Set of (experimental) observations: O (K,I) does not entail O To expand (K,I) to (K’, I’): (K’, I’) entails O How to expand (hypothesis space)

Explanation: expand only I Diagnosis: normality assumptions about I, minimally

abandon the normality assumptions Hypothesis formation: expand K

Construction of hypothesis space Present: manual construction, using research

literature Future: integration of multiple data sources

Protein interactions Pathway databases Biological ontologies……..Provide cues, hunches such as

A may interact with B: action interact(A,B)A-B interaction may have effect C:

interact(A,B) causes C

Generation of hypotheses

Enumeration of hypotheses Search: computing with Smodels (an

implementation of AnsProlog) Heuristics

A trigger statement is selected only if it is the only cause of some action occurrence that is needed to explain the novel observations.

An inhibition statement is selected only if it is the only blocker of some triggered action at some time.

Maximizing preferences of selected statements

Generation … (cont’): heuristics Knowledge base K

a causes g b causes g

Initial condition I = { intially f } Observation O = { eventually g } (K,I) does not entail O Hypothesis space: to expand K with rules among

f triggers a f triggers b

Hypotheses: { f triggers a }, or { f triggers b }

Case study: p53 network

Tumor suppression by p53

p53 has 3 main functional domains N terminal transactivator domain Central DNA-binding domain C terminal domain that recognizes DNA damage

Appropriate binding of N terminal activates pathways that lead to protection of cell from cancer.

Inappropriate binding (say to Mdm2) inhibits p53 induced tumor suppression.

p53 knowledge base

Stress high(UV ) triggers upregulate(mRNA(p53))

Upregulation of p53 upregulate(mRNA(p53)) causes high(mRNA(p53)) high(mRNA(p53)) triggers translate(p53) translate(p53) causes high(p53)

p53 knowledge base (cont.)

Tumor suppression by p53 high(p53) inhibits growth(tumor)

p53 knowledge base (cont’)

Interaction between Mdm2 and p53 high(p53), high(mdm2) triggers bind(p53,mdm2) bind(p53,mdm2) causes bound(dom(p53,N)) bind(p53,mdm2) causes high([p53 : mdm2]), bind(p53,mdm2) causes ¬high(p53),¬high(mdm2)

Hypothesis formation

Experimental observation: I = { initially high(UV), high(mdm2), high(ARF) } O = { eventually ~ tumorous }

(K,I) does not entail O Need to hypothesize the role of ARF.

Constructing hypothesis space

Levels of ARF and p53 correlate high(ARF) triggers upregulate(mRNA(p53)) high(p53) triggers upregulate(mRNA(ARF))

Interactions of ARF with the known proteins bind(p53,ARF) causes bound(dom(p53,N))

Constructing …(cont’)

Influence of X (=ARF) on other interactions high(ARF) triggers upreg(mRNA(p53)) high(ARF) triggers translate(p53) high(ARF) triggers bind(p53,mdm2)

Constructing …(cont’)

Twelve Generated Hypothesis such as

high(UV) triggers upregulate(mRNA(ARF)) high(ARF), high(mdm2) triggers bind(ARF,mdm2)

Conclusion of part 2 Goal: Automation of hypothesis formation (with respect to

interactions and pathways) Approach: Viewed known qualitative aspects of cell activities as a

knowledge base Used knowledge representation language that

Can express defaults Allows reasoning with incomplete knowledge Can express reasoning as well as problem solving

rules Developed a system BioSigNet-RRH:

Formalizing and reasoning about hypotheses Illustration: Hypothesizing the role of ARF protein in the p53

network.

Future Work on Reasoning about Biochemical Networks (Part I and II) Further development of the language Validation with respect to larger networks

Kohn’s map Networks in Reactome and other repositories

Going from prototype to deployable systems Scaling up challenges

Recent advances in automatic planning Integration with Biopax

Part III: CBioC

http://cbioc.org

Do we have enough knowledge in the various databases Some have been curated into databases. But there is much more in the literature. So what do we do?

Current status of curation from text About 15 million abstracts in Pubmed

3 million published by US and EU researchers during 1994-2004 (800 articles per day)

300 K articles published so far reporting protein-protein interactions in human, yeast and mouse. BIND (in 7 yrs) -- 23K ; DIP – 3K; MINT – 2.4K.

Premise: High cost of human curation Overwhelming cost of large curation efforts

may be unsustainable for long periods BIND: Nov 2005 bad news.

Operated for 7 years Listed over 100 curators & programmers CND $29 million received in 2003, plus other funding

Curation efforts of AFCS has recently stopped. Lack of funding for some genome annotation

projects.

Premise: summary

Human curation of text is expensive. Human curation of text is not scalable. Human curation of text is not sustainable.

Why not resort to computers? – do automatic extraction Lessons from DARPA funded MUCs (message

understanding conferences) in 90s for a decade and at the cost of tens of millions of dollars. Getting to 60% recall and precision is quick Then every 5% improvement is about a years work. Even when we get to 90% for an individual entity extraction

for recognizing 4 related entities: (.9)4 =.64 Lessons from Biomedical text extraction

No proper evaluation. Recognized that recall and precision is not very good even

in the “best” systems.

What do we do?

How do we curate not only the existing articles, but also the future articles?

Too important to give up! Need to think of a new way to do it. Faster computers, better sequencing

technology and better algorithms came to the rescue of the Human Genome project.

Hmm. What resources are we overlooking?

Key Idea

If lots of articles are being written then lot of people are writing them and lot of people are reading them.

If only we could make these people (the authors and the readers) contribute to the curation effort …

Especially the readers; the ones who need the curated data!

Mass collaboration has worked in Wikipedia Project Gutenberg Netflix rating Amazon rating Etc.

Mass collaborative curation: initial hurdles An average reader

(S)he is not normally interested in filling a blank curation form.

We can not make an average reader go though curation training.

So it has to be very different from just making the existing curation tools available to the mass and expect them to contribute.

Mass collaborative curation : key initial ideas Make it very easy:

user need not remember where (which database, which web page) to put the curated knowledge.

Curation opportunity should present itself seamlessly.

Curation should not be a burden to an average user Make the curated knowledge “thin”.

There should be immediate rewards Do not start with a blank slate.

Realization of the key ideas: a biologist with a gene name Goes to Pubmed, types the gene name, clicks on

one of the abstracts Curation panel presents itself automatically

Our approach calls for researchers to contribute to the curation of facts as they read and research over the web

But not with a blank slate No one wants to be the first one! Automatic extraction jump-starts the process, and then

researchers improve upon the extracted data, “ironing out” inconsistencies by subsequent edits on a massive scale.

Thin Schemas Average users turned off by traditional wide schemas Wide schemas need to be broken down.

Summary

Information/curation window pops up automatically. Automatic extraction is used as a boot strap so that

no user is working on a blank slate. Users vote on correctness, make corrections, add

fact. Suppose 60% precision and recall of automatic extraction

system A person will have an easier time discarding 40% of

wrongly extracted text than identifying 60% of correct entries and entering them!

Very useful byproducts

Avoids some problems with existing human curation approach Curators’ bias Curators miss things Curators have disagreements Slow access to newest findings Researchers at large have little or no control over what

gets curated and when A large curated corpus of text gets created

Very useful to evaluate and improve automated extraction systems.

Current status of CBioC; future plans Basic system, as described, is ready Being populated with

Facts from existing databases (BIND etc.) Facts extracted using our extraction system

Querying mechanism Answer display

Future work Voter confidence issues …

Conclusion

Collecting what is known Reasoning with what is known Hypothesizing what is unknown

(based on observations)

Open Invitation

We are building and eager to help other groups build knowledge bases in particular domains to Predict impact of interventions Plan (therapy design) to make a pathway behave

in a desired way Explain observation Hypothesize new knowledge Further improvements to and adaptation of CBioC

Acknowledgements

BioSignet Nam Tran, Ph.D thesis on this, Postdoc @ Yale Karen Chancellor, Ph.D student Michael Berens and his group (Ana Joy, Nhan Tran) Lokesh Joshi and his group (Vinay Nagraj)

CBioc: Graciela Gonzalez, Lian Yu, Luis Tari, Tony Gitter, Amanda Ziegler, Ryan Wendt, Prabhdeep Singh.

Other projects: BioQA Biogenenet

Thank you!

chitta baral arizona state university

Documents