13:06, october 19, 2007
TRANSCRIPT
Chemical Data Mining of the NCI Human Tumor Cell Line Database
H. Wang, J. Klinginsmith, X. Dong, A. C. Lee, R. Guha, Y. Wu, G. M. Crippen, and D. J. Wild
Yoon Soo Pyon [email protected]
October. 19th, 2007
Outline
1. Introduction
2. Characterization of the chemical compounds
3. Characterization of the cell line screening growth inhibition values
4. Characterization of the gene expression results
5. Relating dictionary-based structural keys to cellular screening activities
6. Predictive models of activity
7. Relating freely generated SMARTS structures to cellular screening activities
Introduction
• NCI Developmental Therapeutics Program (DTP) Human Tumor cell line Dataset is a publicly available database containing cellular assay screening data for over 40,000 compounds tested in sixty human tumor cell lines.
• Also contains microarray assay gene expression data (961 values) for the cell lines, and so it provides an excellent information resource particularly for chemical, biological and genomic information.
• Formal knowledge discovery approach to characterizing and data mining this set to discover relationship between compounds and biological activity values.
NCI 60 Cell Lines data
• 60 cell lines include melanomas, leukemia, and cancers of the breast, prostate, lung, colon, ovary, kidney and central nervous system.
• Screening results includes three parameters.
GI50 – 50% growth inhibitionTGI – total growth inhibitionLC50 – 50% lethal concentration
Usually use -log(GI50)
COMPARE algorithm
What we have examined
Seed compound
Similarity Search
NCI 60 cell line data correlation
search
Target compounds
Characterization of the chemical compounds
• They implemented local version of database containing 44,653 compounds, screening results, and gene expression value using PostgreSQL and gNova CHORD.
• gNova CHORD allows chemical searching and generation of 166 bit structural key fingerprints.
Characterization of the chemical compounds
• Calculation and profiling of predicted property values compared to two other datasets.
• FDA’s Maximum Recommended Therapeutic Dose (MRTD) set : representative of current marketed drugs• Randomly selected 40,000 compound subset of PubChem : representative of a diverse set
• Calculated properties (Molecular weight, xlogP, Polar surface area, # of Hydrogen bond donors and acceptors) using OpenEye FILTER
H-Bond donors
H-Bond acceptors
Molecular Weight
XlogP
Solubility
Polar Surface Area
Characterization of the chemical compounds
• Compared the similarity of the drug compounds in the MRTD with the most similar compounds in the tumor cell line set
Characterization of the cell line screening growth inhibition value
• They examined the distribution of –log(GI50) data points across cell lines and compounds.• Overall 12.1% of the cell line screening data points are missing.• Overall 44.9% of growth inhibition values are equal to 4.0
• inactive : –log(GI50) < 5
• active : –log(GI50) ≥ 5
• Overall 19.6% compounds are considered active.
Characterization of the cell line screening growth inhibition value (Cont’d.)
Characterization of the gene expression results
• Under-expression from the norm : < 0• Over-expression from the norm : ≥ 0
Relating dictionary-based structural keys to cellular screening activities
• The activity classification (active/inactive) and the structural key fingerprint bits were used to determine which structural features were either more prevalent or scarce in active compounds compared with inactive compounds• The active-structural ration
• The overall-structural ration
compoundsactiveofset
jfeaturewithcompoundsactiveoftotal
C
TR
a
jaja
#,,
compoundsofsetcomplete
jfeaturewithcompoundsoftotal
C
TR jj
#
Relating dictionary-based structural keys to cellular screening activities (cont’d.)
• diffj ≥ 0 : The greater percentage of this feature apearing in the active cells
• diffj < 0 : lack of feature in the active compounds compared with all compounds
jjaj RRdiff ,
Since nearly all 60 cell lines follow the same track, average difference of the active ratio and the overall ratio can be to find the most important substructures in determining the “global” activity and inactivity.
Relating dictionary-based structural keys to cellular screening activities (cont’d.)
• Features associated with global activity indicate the tendency to bind anything.• Features associated with global inactivity indicate the tendency to stop binding to tumor growth related properties.•105, 127, 145, 152, 99 are the most important bits for activity.• 117, 110, 92, 77, 95 are the most important bits for inactivity.
Relating dictionary-based structural keys to cellular screening activities (cont’d.)
Relating dictionary-based structural keys to cellular screening activities (cont’d.)
Predictive Models of Activity
• They designed machine learning model using WEKA to predict individual activity in each of the 60 cell lines.• Applied AD-Tree, Ridor methods, which work best, on various feature set using cell line 60 (UO-31)
Predictive Models of Activity
• Not all 166 features are useful in detrming the cell line activity.
• Thus, feature selection helps increase the prediction accuracy.
Predictive Models of Activity (cont’d.)
• http://www.chembiogrid.org/cheminfo/ncidtp/dtp
Relating freely generated SMARTS structure to cellular screening activities
• Previously, experiments used a constrained dictionary of 166 SMARTS fragments.
• Modified to generate a larger number of SMARTS-based keys. (U of Michgan’s method)
• Lengthening and scoring SMARTS string is applied in order to established SMARTS strings up to seven atom long that have strong tendency to identify active and inactive compounds.
Relating freely generated SMARTS structure to cellular screening activities
• Scoring:
• The ratio of active to inactive compounds in NCI/DTP dataset is 7274 to 35664 Find one active compound to every five inactive compound.• Thus, ratio of significance is 1:5 or 0.2
(consider tenfold improvement)• active compounds : ratio > 2.0• inactive compounds: ratio < 0.02
stringSMARTSsamebyidentifiedcompoundsinactiveofnumber
stringSMARTSabyidentifiedcompoundsactiveofratio
Relating freely generated SMARTS structure to cellular screening activities
Thank You