fingerprinting chemical structures
TRANSCRIPT
![Page 1: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/1.jpg)
Fingerprin(ng Chemical Structures
Rajarshi Guha h7ps://github.com/rajarshi/ctpa-‐fingerprints
September 9 2014
![Page 2: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/2.jpg)
High Throughput Screening
• Test thousands to hundreds of thousands of compounds in one or more assays – Biochemical, gene(c, pharmacological assays
• Employs a robo(c plaLorm • Rapidly iden(fy novel modulators of biological systems – Infec(ous agents – Cellular basis of diseases
![Page 3: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/3.jpg)
Goal of HTS
• Rapidly screen large compound collec(ons
• Efficiently iden(fy real ac(ves – Test them in slower, accurate, expensive screens
• Use the data to learn what types of compounds tend to be ac(ve
• Use the model to suggest more compounds to screen
300K
1000
300
Nu
mb
er o
f M
ole
cu
les
Cherry
Picks
HTS
![Page 4: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/4.jpg)
HTS Data Types
• Categorical – ac(ve/inac(ve or toxic/nontoxic • Con(nuous – Single point – Dose response
• Mul(ple readouts – Might read at different wavelengths or (mepoints – More complex when dealing with imaging
• These (usually) represent the dependent variable
30
60
90
120
0.01 1.00log10 Concentration
Response
0
25
50
75
100
9.50 9.75 10.00 10.25 10.50Concentration
Response
![Page 5: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/5.jpg)
Independent Variable(s)
• HTS tests the ac(vity of a molecule – the molecule is our “independent variable”
• Need to describe the molecular structure – Various discrete or real-‐valued descriptors – Surfaces (3D) – Binary fingerprints
Activity = f Structure( )
![Page 6: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/6.jpg)
Fingerprint Representa(on
• Lots of types of fingerprints • “Keyed” fingerprints indicate the presence or absence of a structural feature
• Length can vary from 166 to 4096 bits or more • Fingerprints usually compared using the Tanimoto metric
1 0 1 1 0 0 0 1 0
![Page 7: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/7.jpg)
What Can I Use Them For?
• Search – Given a potent ac(ve molecule, find similar ones (or dissimilar, but also potent)
• Predic(on – Given a set of ac(ve & inac(ve molecules build a model to predict which members from a large collec(on will be ac(ve
• Clustering – Given a set of molecules, do they cluster into structurally different groups?
![Page 8: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/8.jpg)
Fingerprints in R
• The fingerprint package supports I/O, manipula(on, similarity methods, and various u(lity methods
• A fingerprint is a S4 object – Create them manually
– Read them in from files
new("fingerprint", nbit = 1024, bits = c(1,4,5,100,200))
fp.read('data/cdk.fp', size=1024, lf=cdk.lf)
![Page 9: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/9.jpg)
Gehng Fingerprints
• You can also generate fingerprints from chemical structures using the rcdk package
• If you’re not doing cheminforma(cs you can read in your own FP data by implemen(ng a line reader!– See cdk.lf, moe.lf, bci.lf!!
![Page 10: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/10.jpg)
Random Fingerprints
• Useful for benchmarking, genera(ng null distribu(ons, exploring effects of bit density
## How long does a similarity matrix calculation take as a function of fp length? nfp <- 300 sizes <- c(64, 128, 512, 1024, 4096, 8192) times <- sapply(sizes, function(size) { fps <- lapply(1:nfp, function(i) random.fingerprint(size, size * 0.35)) system.time(junk <- fp.sim.matrix(fps))[3] }) ## For a given length, how does bit density affect calculation time? densities <- c(0.1, 0.25, 0.5, 0.75, 0.95) times <- sapply(densities, function(density) { fps <- lapply(1:nfp, function(i) random.fingerprint(1024, 1024 * density)) system.time(junk <- fp.sim.matrix(fps))[3] })
![Page 11: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/11.jpg)
Random Fingerprints
0.0
0.2
0.4
0.6
0 2000 4000 6000 8000Fingerprint Length
Tim
e (s
)
0.066
0.068
0.070
0.072
0.25 0.50 0.75Bit Density
Tim
e (s
)
![Page 12: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/12.jpg)
0
1
2
3
0.00 0.25 0.50 0.75 1.00Similarity
density
MetricDice
Tanimoto
Compare Similarity Metrics
• More than 20 similarity metrics – Some are in wri7en in C, so very fast, applicable to larger fingerprint collec(ons
– Others are in pure R, slow
fps <- fp.read('data/cdk.fp', size=881, lf=cdk.lf, header=TRUE)[1:500] s.tanimoto <- fp.sim.matrix(fps, method='tanimoto') s.dice <- fp.sim.matrix(fps, method='dice') d <- rbind(data.frame(method='Tanimoto', s=as.numeric(s.tanimoto)), data.frame(method='Dice', s=as.numeric(s.dice)))
![Page 13: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/13.jpg)
Predic(ng with Fingerprints
• Read in fingerprints & convert to matrix form • See
– data/solubility.csv – data/solubility.maccs!
• 33,182 observa(ons of solubility
• 57,857 fingerprints • Requires some data wrangling before modeling
OOB estimate of error rate: 22.37% Confusion matrix: high low medium class.error high 181 52 621 0.78805621 low 35 5611 4598 0.45226474 medium 89 2029 19965 0.09591088
0
5000
10000
15000
20000
high low mediumSolubility Class
Frequency
![Page 14: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/14.jpg)
Predic(ng with Fingerprints
• The model will use MACCS keys – 166 bits – Each bit is associated with a structural feature
• Low resolu(on, somewhat simplis(c • Data comes in a non-‐standard format, so we must implement our own line reader
• Classifica(on problem – predict low/medium/high solubility
![Page 15: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/15.jpg)
Predic(ng with Fingerprints sol <- read.csv('data/solubility.csv', header=TRUE) fps <- fp.read('data/solubility.maccs', header=FALSE, size=166, lf=function(line) { toks <- strsplit(line, " ")[[1]] title <- toks[1] bits <- as.numeric(toks[2:length(toks)]) list(title, bits, list()) }) ## Extract fingerprint for which we have a label common <- which( sapply(fps, function(x) x@name) %in% sol$sid ) fps <- fps[common] ## Order the fingerprints & data sol <- sol[order(sol$sid),] fps <- fps[order(sapply(fps, function(x) as.integer(x@name)))] ## Make X matrix fpm <- fp.to.matrix(fps) ## Model! library(randomForest) m1 <- randomForest(x=fpm, y=as.factor(sol$label))
![Page 16: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/16.jpg)
Predic(ng with Fingerprints
• We can then use the RF variable importance measure
• Features important for predic(ve performance – Presence of aroma(c rings – Presence of charged atoms – Presence of 6-‐membered rings – N & O atoms connected in a chain
• Chemically sensible 1208590100138776599961521111331319316013280959879150135144971496210514549125
0 50 150 250
MeanDecreaseGinih7ps://github.com/cdk/cdk/blob/master/descriptor/fingerprint/src/main/resources/org/openscience/cdk/fingerprint/data/maccs.txt
![Page 17: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/17.jpg)
Clustering with Fingerprints
• Generate a distance matrix directly from a list of fingerprints
fps <- fp.read('data/cdk.fp', size=881, lf=cdk.lf)[1:500] sims <- fp.sim.matrix(fps) dmat <- as.dist(1-sims) clus <- hclust(dmat) par(mar=c(1,4,1,1)) plot(clus, label=FALSE, xlab='', main='’)
0.0
0.2
0.4
0.6
0.8
Height
• Exercise: How do clusters vary with similarity metric and/or fingerprint type?
![Page 18: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/18.jpg)
Comparing Data Sets
• How do we compare two sets of chemical structures? – Sizes may be different, and very large
• Pairwise? – O(N2) running (me – Need to aggregate the resultant pairwise values
![Page 19: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/19.jpg)
Comparing Data Sets
• How do we compare two sets of chemical structures? – Sizes may be different, and very large
• Distribu(ons? – Of what? – Can lead to mul(ple ways to generate a comparison
– Data fusion?
![Page 20: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/20.jpg)
0.00
0.25
0.50
0.75
1.00
0 250 500 750Bit Position
Nor
mal
ized
Fre
quen
cy
Bit Spectrum
• Vector summary of the fingerprints for a dataset • Defined as the frac(on of (mes a bit posi(on is set to 1, for each bit posi(on
0 0 1
0 1 0
1 1 1
1 0 1
0.5 0.5 0.75
...
...
...
...
...
~ 10K molecules
![Page 21: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/21.jpg)
Bit Spectrum
• Now comparison of two datasets is a O(1) opera(on – independent dataset size – Simply take the difference of the two bit spectra
• e.g.: Compare ~ 800 solubles with > 30k insolubles ## make two subsets and generate bit spectra sol.idx <- which(sol$label == 'high') insol.idx <- which(sol$label != 'high') sol.bs <- bit.spectrum(fps[sol.idx]) insol.bs <- bit.spectrum(fps[insol.idx]) ## display a difference plot bsdiff <- sol.bs - insol.bs d <- data.frame(x=1:length(sol.bs), y=bsdiff) ggplot(d, aes(x=x,y=y))+geom_line()+ xlab('Bit Position')+ ylab('Normalized Frequency')+ ylim(c(-1,1))
-1.0
-0.5
0.0
0.5
1.0
0 50 100 150Bit Position
Δ N
orm
aliz
ed F
requ
ency
![Page 22: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/22.jpg)
Explaining Poor Model Performance
• Training set for model
• Poor predic(ons on test set
• Both test set classes look like the toxic class in the training set
Guha & Schurer, J. Comp. Aided. Molec. Des., 2008, 22, 367
![Page 23: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/23.jpg)
Summary
• Fingerprints are a useful representa(on for molecules – fast, objec(ve, compact
• But are applicable to other domains and objects – Can be generated from arbitrary datasets (e.g. text) or objects (e.g. networks)
• Useful for various tasks – search & comparison, predic(on, clustering
• The fingerprint package provides a domain agnos(c way to handle binary fingerprints
![Page 24: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/24.jpg)
![Page 25: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/25.jpg)
Comparing Clusterings
• Generate mul(ple representa(ons of a set of molecules
• How differently do these representa(ons cluster? – Measure correla(on of clusters using cophene(c coefficient
• A variety of R packages to support this – dendextend, clValid
![Page 26: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/26.jpg)
Comparing Clusterings
0.8 0.6 0.4 0.2 0.0
Pubchem 881
181187185218194219186146193150208207217202209200201180184183182192233236121901991642344111183316811516716917011011181828384116117919249524635474844504353451036381439173732401331221224222223263213225230231211216210212251264265206238229252253227254266273274272278275277288299293281282298276295296294256291287297280286271270255269267268292283284285289290915772151797557426767781970712678738687424272538029881911771641781482412442462432452262422402602592619098666930689395108546559941006357615660641016285586728558999969716610713527914125826214024722823519724818824919620525023418921522014719523920421423223719820310912513917217312613814917515613713616217417612811212913317911415711315916115816014413413013115515415115216517114516313215330012112225722202123103119105123106124104118120142143102127
0.0 0.2 0.4 0.6 0.8
CDK Ext 1024
257123120118124119103142143127102104105106233236300121122887029867987424915725767781926737557472151273807826711412471881972481472042202391952141962052491892342152502282351401982032322372272532542292522992752772742732722782662812822882852832842892902912862952962932802942922672682692702552712982872972562762432452462442262422602592402612412582621661071352162512112302312232242222132252102122212632642062382652791911771641784645474852103544435349365011011116817081849211611516916782839111793899799641086510110062855660635761965994952855545867306866906998175149176156129155157112161128154133159113114179158160151152174163144134130131145153171132165162136137200207202208217201209199185187218194219186146193148164234411211183313311438403937173215019022202123109125139172173126138181183182192180184
![Page 27: Fingerprinting Chemical Structures](https://reader033.vdocuments.mx/reader033/viewer/2022052622/559366fb1a28ab9f2d8b462a/html5/thumbnails/27.jpg)
Comparing Clusterings
Pairwise cophene(c correla(ons for clusterings generated using different
fingerprints
Pubchem CDK Extended CDK Graph MACCS!Pubchem 1.0000000 0.7075479 0.6879805 0.5752923!CDK Extended 0.7075479 1.0000000 0.8050349 0.7386863!CDK Graph 0.6879805 0.8050349 1.0000000 0.7288428!MACCS 0.5752923 0.7386863 0.7288428 1.0000000!