bioinformatics, data integration and machine learning a thesis proposal

Apr 19, 2023 1

Bioinformatics, Data Integration and Machine

Learninga Thesis Proposal

Kaushik SinhaSupervisors: Prof. Gagan Agrawal and Prof. Mikhail

Belkin

Apr 19, 2023 2

Roadmap Motivation Our Approach Current Work

Learning Layouts of Flat-file Biological Datasets

Exploratory Tools for Biological Data Analysis Proposed Work

Deep Web Mining for Biological Data Semi-supervised Ranking Multiple-instance Learning

Conclusion

Apr 19, 2023 3

Motivation Integration is hard

Data explosion Data size & number of data sources

New analysis tools Autonomous resources

Heterogeneous data representation & various interfaces

Frequent Updates New trend: web and grid services

Apr 19, 2023 4

Motivation contd… In recent years DNA microarry and other gene

and protein assays have become essential tools for biologists

Next step of biological enquiry is to find out What is known about these genes? How are these genes related to each other or other

genes identified in similar studies? However, major difficulties are

How do we extract key properties shared by a candidate genes?

How do we generate reasonable hypothesis to explain them?

How do we define and evaluate similarity between sets of genes?

Apr 19, 2023 5

Motivating Example Suppose after a micro array experiment a biologist suspects that a

small set of genes are related to a disease This can be confirmed by searching existing literature One would expect related genes to appear together in literature Due to sheer volume

Searching is time consuming and error prone Some complications could arise as well

However, suppose Gene A and C are related and both of them are weakly related to gene B

In literature, one would expect A,C appear together OR/AND A,B appear together B,C appear together

How do we efficiently conclude that A,C are actually related?

Apr 19, 2023 6

Our Approach Using data mining / machine learning

techniques to extract useful information from biological data

Different forms of data Flat-file data Microarray data Online literature abstracts

Develop different forms of tools Layout extractor Hypergraph mining Similarity measure among sets of genes

Apr 19, 2023 7





Conclusion

Apr 19, 2023 8

Learning Layout of a Flat-File In general – intractable Try and learn the layout, have a

domain expert verify Key issue: what delimiters are

being used ?

Apr 19, 2023 9

Finding Delimiters Some knowledge from domain

expert is required (Semi-automatic) Naïve approaches

Frequency Counting Counts frequently occurring single tokens

(word separated by space) Sequence Mining

Counts frequently occurring sequence of tokens

Apr 19, 2023 10

Assumptions Biological datasets are written for

humans to read It is very unlikely that delimiters will be

scattered all around, in different places in a line

Position of the possible delimiters might provide useful information

Combination of positional and frequency information might be a better choice

Apr 19, 2023 11

Positional Weight

Let P be the different positions in a line where a token can appear

For each position i є P, tot_seqji represents total # of

token sequences of length j starting at position i

For each position i є P, tot_unique_seqji represents total

# of unique token sequences of length j starting at position i

For any tuple (i,j), p_ratio(i,j) is defined as shown above

p_ratio(i,j) can be log normalized to get positional weight, p_wt(i,j) with the property p_wt(i,j) є (0,1)

ji

ji

sequniqetot

seqtotjiratiop

__

_),(_

Apr 19, 2023 12

Delimiter score (d_score) Frequency weight for any token sequence sj

i with length j and starting at position i, f_wt(sj

i), is obtained by log normalizing frequency f(sj

i)

Obviously, f_wt(sji) є (0,1)

Positional and frequency weight now can be combined together to get d_score as follows,

d_score(sji)= α * p_wt(i,j) + (1-α) * f_wt(sj

i) Where α є(0,1)

Thus d_scrore has the following two properties, d_score(sj

i) є(0,1) d_score(sj

i) > d_score(sjk) implies sj

i is more likely to be a delimiter than sj

k

Apr 19, 2023 13

Generating layout descriptor

Once the delimiters are identified, an NFA can be built scanning the whole database where, delimiters are different states of the NFA

This NFA can be used to generate a layout descriptor since it nicely represents optional and repeating states

The following figures shows an NFA, where A, B, C, D, and E are delimiters with B being an optional delimiter and C D being a repeating delimiters

Apr 19, 2023 14

Results By suitably varying α, a tight

superset of possible delimiters are found

A domain expert can then help to identify the true delimiters

Results from 3 different flat file datasets are as follows

Apr 19, 2023 15

Comparison with naïve approaches

d_score based approach definitely does a better job as compared to the naïve approaches

The following table clearly shows the improvement

Apr 19, 2023 16

Realistic Situation The task of identifying complete list

of correct delimiters is difficult Most likely we will end up with

getting an incomplete list of delimiters

The delimiters which does not appear in every data record (optional) are the ones to be possibly missed

Apr 19, 2023 17

Identifying Optional Delimiters Given a list of incomplete

delimiters how can we identify optional delimiters, if any? Build a NFA based on given

incomplete information Perform clustering to identify possible

crucial delimiters Perform contrast analysis

Apr 19, 2023 18

Crucial delimiter A delimiter is considered crucial, if

missing delimiters will appear immediately following these delimiters

The goal is to create two clusters, one having delimiters which are not crucial The other one having crucial delimiters

Apr 19, 2023 19

Identifying crucial delimiters:A few definitions Succ(X): Set of delimiters that can

immediately follow X Dist_App: # of groups of occurrences of

X based on # of text lines between X and immediately next delimiter

Info_Tuple(nXi,fX

i,tXi): Information for

each Dist_App Info_Tuple_List Lx: For any X, list of all

possible Info_Tuple.

Apr 19, 2023 20

Metric for clustering

rXf is likely to be low if an optional delimiter appears

immediately after X, and high otherwise Choose a suitable cut-off value rc and assign

delimiters to different groups as follows,- If rX

f < rc, assign X to a group containing possible crucial delimiters

Else assign X to the group containing non crucial delimiters

totalX

XfX f

fr

max

Apr 19, 2023 21

Observations and Facts Missing optional delimiters can appear

immediately after crucial delimiters ONLY Non-crucial delimiters can be pruned away Consider two Info_Tuples (nX

1, fX1 ,tX

1) and (nX

2, fX2 ,tX

2) in LX

If a missing delimiter appears immediately after the appearance corresponding to the first tuple but not the second one,- nX

1 > nX2

Missing delimiter will appear in tX1 but not in tX

2

Apr 19, 2023 22

A hypothetical example illustrating Contrast Analysis

Suppose, X is a crucial delimiter having 2 Info_tuples, L1 and L2 , as follows,

L1=(50, 20, l1 .txt) L2=(20, 12, l2 .txt)

Sequence mining on l1 .txt and l2 .txt yields two sets of frequently occurring sequences, S1 and S2 , as follows,

S1={ f1 , f5 , f6 , f8 , f13 , f21 } S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 }

Since but , f5 is a possible missing delimiter f5 is a missing delimiter only if it has a high d_score or

is verified by a domain expert as a valid delimiter

15 Sf 25 Sf

Apr 19, 2023 23

Contrast Analysis For any i,j, if nX

i > nXj , look for frequently

occurring sequences in tXi and tX

j, call them fsX

i and fsXj respectively

If there exists a frequent sequence fs such that, but then, fs is quite likely to be a possible delimiter

If fs has a fairly high d_score or identified by a domain expert as valid delimiter add it to the incomplete list as newly found delimiter

iXfsfs j

Xfsfs

Apr 19, 2023 24

Generalized Contrast Analysis In case of more than two Info_Tuples,

identify mean of all nXi values

Form a group by appending text from all Info_Tuples, where

Form another group by appending text from all Info_Tuples, where

Perform contrast analysis among all such possible groups

totalX

l

i

iX

iX

meanX f

fnn

1

meanX

iX nn

meanX

jX nn

Apr 19, 2023 25

Another example illustrating Generalized Contrast Analysis

Suppose, X is a crucial delimiter having 3 Info_tuples, L1 , L2 , L3

, as follows, L1=(50, 20, l1 .txt) L2=(20, 12, l2 .txt) L3=(15, 10, l3 .txt)

Mean number of lines, Append l2 .txt and l3 .txt , call it t2 .txt Sequence mining on l1 .txt and t2 .txt yields two sets of frequently

occurring sequences, S1 and S2 , as follows, S1={ f1 , f5 , f6 , f8 , f13 , f21 } S2={ f1 , f4 , f6 , f7 , f8 , f10 , f13 , f21 }

Since but , f5 is a possible missing delimiter f5 is a missing delimiter only if it has a high d_score or is verified

by a domain expert as a valid delimiter

15 Sf 25 Sf

09.33101220

)1015()1220()2050(

meanXn

Apr 19, 2023 26

Overall Algorithms

Apr 19, 2023 27

Results: Optional delimiters

% Pruning=

Apr 19, 2023 28

Results: Non-optional Missing delimiters

Even though designed for finding optional delimiters, our algorithms works, in some cases, for missing non-optional delimiters too

If a missing non-optional delimiter appears exactly in the same location in each record, then our algorithm fails

If a non-optional delimiter has a backward edge coming from a delimiter that appears later in a topologically sorted NFA then our algorithm works

Apr 19, 2023 29





Conclusion

Apr 19, 2023 30

Hypergraph Mining Basic Motivation

To find useful “Transitive Relation” (hypergraphs) among genes

Example (Gene-Disease Relationship) Gene A is related to a gene B Gene B is related to a gene C Is Gene A related to Gene C ?

Gene Source Microarray Experiments

Information Source Online Literature abstracts

Apr 19, 2023 31

Formal Problem Definition Given

A dictionary KT of keywords A dictionary KM of user provided key words (KTכKM) Collection of literature abstracts,- each abstract is

represented as a set of keywords Task

To find hyperedges exceeding user defined threshold, each of which involves a set of key words from KM and are potentially connected by another set of linking words from KT-KM

Apr 19, 2023 32

Modeling Purpose

To use a similar approach as frequent itemset mining Define

total weight=support + cross support Support: set of keywords appear together in one

document Cross support: set of keywords can be partitioned so

that each partition appears in different document Issues

Since downclosure property does not hold for total weight modified downclosure property can be defined

Apr 19, 2023 33

Idea Support satisfies downclosure property

Let X be a set, Ω be its power set. A function f : Ω →R+ satisfies downclosure property if for all A,B ∈ Ω , A כ B ,f(B)>f(A)

Cross support can be designed to be restricted below a particular value, i.e., it is bounded

Form a function h as addition of two functions h=f+g f satisfies downclosure property g is bounded

h satisfies modified down closure property For any θ≥0, if h(Kn) ≥θ then f(Kn-1) ≥ max{0,(θ-sup(g))}

This property can be used to devise efficient algorithm

Apr 19, 2023 34

Results

Apr 19, 2023 35

Similarity Measure among sets of genes

Each file containing gene names can be considered as a Discrete Random Variable (DRV)

Each such DRV can take several values (gene names)

For two such files X,Y and for any pair (x,y), x∈X and y∈Y, p(x,y) can be computed from online abstracts based on co-occurrence

Now defining Z=g(X,Y), Z is a RV Expectation of Z can be used as a similarity

measure Different g gives rise to different similarity

measure

Apr 19, 2023 36





Conclusion

Apr 19, 2023 37

Query Planning for Deepweb Mining

A huge source of online biological information is available in the form of deepweb

An online query form query form needs to be filled out

Required information is available by filling out may such forms from different websites

There might be some dependency among these forms

Requires Redundancy elimination

Apr 19, 2023 38





Conclusion

Apr 19, 2023 39

Semi-supervised Ranking Ranking

Given a training set of examples with labels/pair wise relationships

Task is to rank an unseen test set, i.e. to get a permutation so that relevant examples are ranked higher than irrelevant ones

This corresponds to learning a ranking function Semi-supervised Ranking

Incorporating unlabeled examples to learn the ranking function

Out of sample extension

Apr 19, 2023 40

Potential Application Following a microarray experiment it might be

possible to guess if gene A is more important than gene B involved in the experiment

However all possible order relationship is time consuming end error prone

Thus, from a small set of order relationship and using other genes from the experiment as unlabeled data a semi-supervised ranking function can be learned

Apr 19, 2023 41





Conclusion

Apr 19, 2023 42

Multiple Instance Learning Instead of instance-label pair (x,y), bag-label

pair (B,y) is provided as training data A bag contains multiple instances A bag label is negative, if each instance in

the bag has negative label A bag label is positive, if there exists at least

one instance with positive label Given an unseen bag, the task is to predict

its label

Apr 19, 2023 43

Potential Application Following a microarray experiment it

might be possible to form bags of genes with appropriate labels

From different biological labs doing similar experiments, many such bags can be obtained to use as training data

Before, designing a new microarray experiment, gene set can be selected based on multiple instance learning

Apr 19, 2023 44

Summary Use of data mining /machine learning

techniques to extract information for biological data

Work done Learning layouts of flat-file biological datasets Hypergraph Mining Similarity Measure among sets of genes

Proposed Work Study and application of machine learning techniques

bioinformatics, data integration and machine learning a thesis proposal

Documents