artificial intelligence research laboratory department of computer science
DESCRIPTION
a 11. a 10. …. a 7. a 8. a 9. SIN. INQ. NQK. QKL. KLA. a 1. a 2. a 3. a 4. a 5. a 6. Let. X be a finite set, and. two probability distributions over X. INQ. SIN. KLA. NQK. QKL. The weighted Jensen-Shannon divergence is given by:. LVI. LAL. ALV. - PowerPoint PPT PresentationTRANSCRIPT
Artificial Intelligence Research LaboratoryDepartment of Computer Science RECOMB 2009
Acknowledgements: This work is supported in part by a grant from the National Science Foundation (NSF 0711356) to Vasant Honavar.
Combining Abstraction and Super-structuring on Macromolecular Sequence ClassificationAdrian Silvescu, Cornelia Caragea, and Vasant Honavar
Introduction:The choice of features that are used to describe the data presented to a learner, and the level of detail at which they describe the data, can have a major impact on the difficulty of learning, and the accuracy, complexity, and comprehensibility of the learned predictive model. The representation has to be rich enough to capture the distinctions that are relevant from the standpoint of learning, but not so rich as to make the task of learning infeasible.
Results:
Eukaryotes 3-grams Prokaryotes 3-grams
Eukaryotes 2-grams Prokaryotes 2-grams
10 Abstractions 1000 Abstractions
Comparison of super-structuring and abstraction (SS+ABS) with super-structuring and feature selection (SS+FSEL), super-structuring only (SS_ONLY), and unigram (UNIGRAM) on the Eukaryotes and Prokaryotes data sets.
Class distributions induced by one of the m abstractions, and the class distributions induced by three 3-grams sampled from the abstraction on the Eukaryotes 3-gram data set, where (a) m=10; and (b) m=1000. The number of classes is 4.
Problem: Predict the subcellular localization for a protein sequence.
Previous Approaches to Feature Construction:
Constructing Abstractions over k-grams:
Super-structuring: generating k-grams
SINQKLALVIKSGKYTLGYKSTVKSLRQGKSKLIIIAANTPVLRKSELEYYAMLSKTKVYYFQGGNNELGTAVGKLFRVGVVSILEAGDSDILTTLA
INQ
SIN
NQKQKL
KLA
LVILALALV
Abstraction: grouping similar features to generate more abstract features
Example:
Our Approach:
Combining super-structuring and abstraction to construct new features
…
Distance between Abstractions:
dist(ai,a j ) (p(ai) p(a j ))WJS([p(Y | ai),# ai],[p(Y | a j ),# a j ])
a1 a2 a3 a4 a5 a6
a7 a9a8
a10
a11
SIN INQ NQK QKL KLA …
WJS([p1(),w1],[p(),w2])
1KL(p1() || p()) 2KL(p2() || p())
Let
1 w1
w1 w2
,
2 w2
w1 w2
,
p() 1p1() 2p2()
The weighted Jensen-Shannon divergence is given by:
w1,w2 [0,), X be a finite set, and
p1,
p2 two probability distributions over X
Then, distance between two abstractions is defined as follows:
Feature selection:
Data sets:
Conclusions:
where Y is the class variable.
greedy agglomerative procedure initially map each abstraction to a k-gram recursively group pairs of abstractions until m
abstractions are obtained, e.g., m=2
alternative approach to reducing the number of k-grams to m k-grams
we used mutual information between the class variable and k-grams to rank the k-grams
Eukaryotes contains 2,427 protein sequences classified into one of four classes
Prokaryotes contains 997 protein sequences classified into one of three classes
combining super-structuring and abstraction makes it possible to construct predictive models that use significantly smaller number of features than those obtained using super-structuring alone.
abstraction in combination with super-structuring yields better performing models than those obtained by feature selection in combination with super-structuring.
We have shown that: