data mining with neural networks

Download Data Mining with Neural Networks

If you can't read please download the document

Upload: chaney-abbott

Post on 15-Mar-2016

68 views

Category:

Documents


3 download

DESCRIPTION

Data Mining with Neural Networks. Standard data mining terminology Preprocessing data Running neural networks via Analyze/StripMiner Cherkassky’s nonlinear regression problem Magnetocardiogram data CBA (chemical and biological agents) Data Drug design with neural networks - PowerPoint PPT Presentation

TRANSCRIPT

  • Data Mining with Neural Networks Standard data mining terminology Preprocessing data Running neural networks via Analyze/StripMiner Cherkasskys nonlinear regression problem Magnetocardiogram data CBA (chemical and biological agents) Data Drug design with neural networks The paradox of learning Principal Component Analysis (PCA) The Kernel Transformation and SVMs (Support Vector Machines) Structural and empirical risk minimization (Vapniks theory of statistical learning)

  • Basic Terminology - MetaNeural Format - Descriptors, features, response (or activity) and ID - Classification versus regression - Modeling/Feature detection - Training/Validation/Calibration - Vertical and horizontal view of data Outliers, rare events and minority classes Data Preparation - Data cleansing - Scaling Leave-one-out and leave-several-out validation Confusion matrix and ROC curvesStandard Data Mining Terminology

  • Installing Basic Version of Analyze Put analyze and gnuplot and wgnuplt.hlp and wgnuplot.mnu in working folder gnuplot scripts for plotting are: - analyze resultss.ttt 3305 for scatterplot - analyze resultss.ttt 3313 for errorplot - analyze resultss.ttt 3362 for baniary classification More fancy graphics are in the *.jar files (needs java runtime environment) For basic help you can try: - analyze > readme.txt - analyze help 998 - analyze help 997 - analyze help 008 For beginners (unless the Java runtime environment is installed), I recommend displaying results via gnuplot operators 3305, -3313 and 3362 To familiarize with Analyze, study the script files from this handout Dont forget to scale data

  • Running neural networks in Analyze/Stripminer Prepare a.pat and a.tes files for training and testing (or what you want to name it) Make sure data are in MetaNeural format and properly scaled (scaling: analyze a.txt 8) (splitting: analyze a.txt.txt 20; seed 0 keeps order) (copy cmatrix.txt a.pat and copy dmatrix.txt a.tes) Run neural network analyze a.pat 4331 copy a meta, edit meta and run again for overriding parameter settings Results are in resultss.xxx and resultss.ttt for training and testing respectively Either descale (option 4) and inspect results.xxx and results.ttt (analyze resultss.xxx 4; analyze resultss.ttt 4) Or visualize via analyze resultss.ttt 3305 (and 3313, and 3362)

  • A Vertical and a Horizontal View of the Data Matrix Vertical view: feature space Horizontal view: data space

  • Preprocessing: Basic scaling for neural networks Mahalanobis scale descriptors

    [0-1] scale response

    Use operator 8 in Analyze code: e.g., typing analyze a.pat 8 will give scaled results in a.pat.txt Note: another handy operator is the splitting operator (20) e.g., typing < analyze a.pat.txt 20> will split file in cmatrix.txt and dmatrix.txt usimg 0 as random number seed put the first #data in cmatrix.txt using a different seed scrambles up data

  • Cherkasskys Nonlinear Benchmark Data Generate 500 data (400 training; 100 testing)

    Impossible data for linear modelsK-PLSPLSNote: eta = 0.01; train to 0.02 error

    cherkasm

    REM cherkasm

    REM GENERATE DATA (2 500 2)

    analyze a.pat 3301

    REM SCALE DATA

    analyze cherkas.pat 8

    REM SPLIT DATA IN TRAINING AND TEST SET (400 2)

    analyze cherkas.pat.txt 20

    copy cmatrix.txt a.pat

    copy dmatrix.txt a.tes

    REM RUN METANEURAL VIA ANALYZE

    analyze a.pat 4331

    DESCALE RESULTS

    analyze resultss.ttt -4

    REM VISUALIZE RESULTS FOR TEST SET

    analyze resultss.ttt -3305

    pause

    analyze resultss.ttt -3313

    pause

    gnuplot error1.plt

    pause

    REM VISUALIZE RESULTS FOR TRAINING SET

    analyze resultss.xxx -3305

    pause

    analyze resultss.xxx -3313

    pause

  • Iris DataFor homework: copy a meta Edit meta for different experiments summarize and report on experiments

    irism

    REM IRISM.BAT (3 classes)

    REM GENERATE DATA (5)

    analyze iris 3301

    REM STRIP HEADER

    analyze iris.txt 100

    REM SCALE DATA

    analyze iris.txt.txt 8

    copy iris.txt.txt.txt a.txt

    REM SPLIT DATA (100 2)

    analyze a.txt 20

    copy cmatrix.txt a.pat

    copy dmatrix.txt a.tes

    REM METANEURAL

    REM do copy a meta afterwards to customize

    analyze a.pat 4331

    pause

    REM SCATTERPLOT FOR TEST DATA

    analyze resultss.ttt -3305

    pause

    REM ERRORPLOT FOR TEST DATA

    analyze resultss.ttt -3313

    pause

    REM SCATTERPLOT FOR TRAINING DATA

    analyze resultss.xxx -3305

    pause

    REM ERRORPLOT FOR TRAINING DATA

    analyze resultss.xxx -3313

    pause

  • Pseudo inverseClassical Regression AnalysisAc

    Sheet1

    NAMEPIEPIFDGRSACMRLamVolDDGTSID

    Ala0.230.31-0.55254.22.126-0.0282.28.50

    Asn-0.48-0.60.51303.62.994-1.24112.38.21

    Asp-0.61-0.771.2287.92.994-1.08103.78.52

    Cys0.451.54-1.4282.92.933-0.119.1113

    Gln-0.11-0.220.293353.458-1.19127.56.34

    Glu-0.51-0.640.76311.63.243-1.43120.58.85

    Gly000224.91.6620.03657.16

    His0.150.13-0.25337.23.856-1.06140.610.17

    Ile1.21.8-2.1322.63.350.04131.716.88

    Leu1.281.7-23243.5180.12131.5159

    Lys-0.77-0.990.78336.62.933-2.26144.37.910

    Met0.91.23-1.6336.33.86-0.33132.313.311

    Phe1.561.79-2.6366.14.638-0.05155.811.212

    Pro0.380.49-1.5288.52.876-0.31106.78.213

    Ser0-0.040.09266.72.279-0.488.57.414

    Thr0.170.26-0.58283.92.743-0.53105.38.815

    Trp1.852.25-2.7401.85.755-0.31185.99.916

    Tyr0.890.96-1.7377.84.791-0.84162.78.817

    Val0.711.22-1.6295.13.054-0.13115.61218

    Sheet2

    Sheet3

  • LS-SVM Adding the ridge makes the matrix positive definite The ridge also performs regularization!!!! The problem is now equivalent to minimizing the following:Heuristic formula for lambda

  • Local Learning in Kernel Space

  • x1xMxiWeights correspond tothe dependent variablefor the entire training dataThis layer gives a similarity scorewith each datapointKind of a nearestneighbor weightedprediction scoreMake up kernelsLocal Learning in Kernel Space

  • K-PLS is like a linear method in nonlinear kernel space

    Kernel space is the latent space of support vector machines (SVMs)

    How to make LS-SVM work? - Select kernel transformation (e.g., usually a Gaussian kernel) - Select regularization parameterWhat Does LS-SVM Do?

  • What is in a Kernel? A kernel can be considered as a (nonlinear) data transformation - Many different choices for the kernel are possible - Most popular is the Radial Basis Function or Gaussian kernel

    The Gaussian kernel is a symmetric matrix - Entries reflect nonlinear similarities amongst data descriptions - As defined by:

  • Data Visualization with Cardiomag Programcardiomag patients.txt 402patients.txtpat1.txt.txtpat2.txt.txtvis.txtvis.txt.txtpat_ID.jpgpat_view.jardata visualization mode(requires Java run time environment)Raw dataWavelet transformed datawave_val.cat

  • Worth its Weight in Gold?

  • Data Mining Applications In DDASSL QSAR drug design Microarrays Breast Cancer Diagnosis(TransScan)

  • Molecule #BR #C #CL #F #H #I #N #O #P #S #SI BALA IDC IDCBAR IDW IDWBAR K0 K1 K2 K3 KA1 KA2 KA3 NXC3 NXC4 NXCH10 NXCH3 NXCH4 NXCH5 NXCH6 NXCH7 NXCH8 NXCH9 NXP10 NXP2 NXP3 NXP4 NXP5 NXP6 NXP7 NXP8 NXP9 NXPC4 SI TOPOL90 TOPOL91 TOPOL92 TOPOL93 TOPOL94 TOPOL95 TOPOL96 TOPOL97 TOPOL98 TOPOL99 WW X0 X1 X2 XC3 XC4 XCH10 XCH3 XCH4 XCH5 XCH6 XCH7 XCH8 XCH9 XP10 XP3 XP4 XP5 XP6 XP7 XP8 XP9 XPC4 XV0 XV1 XV2 XVC3 XVC4 XVCH10 XVCH3 XVCH4 XVCH5 XVCH6 XVCH7 XVCH8 XVCH9 XVP10 XVP3 XVP4 XVP5 XVP6 XVP7 XVP8 XVP9 XVPC4 S001 S002 S003 S004 S005 S006 S007 S008 S009 S010 S011 S012 S013 S014 S015 S016 S017 S018 S019 S020 S021 S022 S023 S024 S025 S026 S027 S028 S029 S030 S031 S032 S033 S034 S035 S036 S037 S038 S039 S040 S041 S042 S043 S044 S045 S046 S047 S048 S049 S050 S051 S052 S053 S054 S055 S056 S057 S058 S059 S060 S061 S062 S063 S064 S065 S066 S067 S068 S069 S070 S071 S072 S073 S074 S075 S076 S077 S078 S079 S080 S081 S082 S083 S084 S085 S086 S087 S088 S089 S090 S091 S092 S093 S094 S095 S096 S097 S098 S099 S100 S101 S102 S103 S104 S105 S106 S107 S108 S109 S110 S111 S112 S113 S114 S115 S116 S117 S118 S119 S120 S121 S122 S123 S124 S125 S126 S127 S128 S129 S130 S131 S132 S133 S134 S135 S136 S137 S138 S139 S140 S141 S142 S143 S144 S145 S146 S147 S148 S149 S150 S151 S152 S153 S154 S155 S156 S157 S158 S159 S160 S161 S162 S163 S164 S165 S166 S167 S168 S169 S170 S171 S172 S173 S174 S175 S176 S177 S178 S179 S180 S181 S182 S183 S184 S185 S186 S187 S188 S189 S190 S191 S192 S193 S194 S195 S196 S197 S198 S199 S200 S201 S202 S203 S204 S205 S206 S207 S208 AbsBNP1 AbsBNP10 AbsBNP2 AbsBNP3 AbsBNP4 AbsBNP5 AbsBNP6 AbsBNP7 AbsBNP8 AbsBNP9 AbsBNPMax AbsBNPMin AbsDGN1 AbsDGN10 AbsDGN2 AbsDGN3 AbsDGN4 AbsDGN5 AbsDGN6 AbsDGN7 AbsDGN8 AbsDGN9 AbsDGNMax AbsDGNMin AbsDKN1 AbsDKN10 AbsDKN2 AbsDKN3 AbsDKN4 AbsDKN5 AbsDKN6 AbsDKN7 AbsDKN8 AbsDKN9 AbsDKNMax AbsDKNMin AbsDRN1 AbsDRN10 AbsDRN2 AbsDRN3 AbsDRN4 AbsDRN5 AbsDRN6 AbsDRN7 AbsDRN8 AbsDRN9 AbsDRNMax AbsDRNMin AbsEP1 AbsEP10 AbsEP2 AbsEP3 AbsEP4 AbsEP5 AbsEP6 AbsEP7 AbsEP8 AbsEP9 AbsEPMax AbsEPMin AbsFuk1 AbsFuk10 AbsFuk2 AbsFuk3 AbsFuk4 AbsFuk5 AbsFuk6 AbsFuk7 AbsFuk8 AbsFuk9 AbsFukMax AbsFukMin AbsG1 AbsG10 AbsG2 AbsG3 AbsG4 AbsG5 AbsG6 AbsG7 AbsG8 AbsG9 AbsGMax AbsGMin AbsK1 AbsK10 AbsK2 AbsK3 AbsK4 AbsK5 AbsK6 AbsK7 AbsK8 AbsK9 AbsKMax AbsKMin AbsL1 AbsL10 AbsL2 AbsL3 AbsL4 AbsL5 AbsL6 AbsL7 AbsL8 AbsL9 AbsLMax AbsLMin BNP BNP1 BNP10 BNP2 BNP3 BNP4 BNP5 BNP6 BNP7 BNP8 BNP9 BNPAvg BNPMax BNPMin Del(G)NA1 Del(G)NA10 Del(G)NA2 Del(G)NA3 Del(G)NA4 Del(G)NA5 Del(G)NA6 Del(G)NA7 Del(G)NA8 Del(G)NA9 Del(G)NIA Del(G)NMax Del(G)NMin Del(K)IA Del(K)Max Del(K)Min Del(K)NA1 Del(K)NA10 Del(K)NA2 Del(K)NA3 Del(K)NA4 Del(K)NA5 Del(K)NA6 Del(K)NA7 Del(K)NA8 Del(K)NA9 Del(Rho)NA1 Del(Rho)NA10 Del(Rho)NA2 Del(Rho)NA3 Del(Rho)NA4 Del(Rho)NA5 Del(Rho)NA6 Del(Rho)NA7 Del(Rho)NA8 Del(Rho)NA9 Del(Rho)NIA Del(Rho)NMax Del(Rho)NMin EP1 EP10 EP2 EP3 EP4 EP5 EP6 EP7 EP8 EP9 Fuk Fuk1 Fuk10 Fuk2 Fuk3 Fuk4 Fuk5 Fuk6 Fuk7 Fuk8 Fuk9 FukAvg FukMax FukMin Lapl Lapl1 Lapl10 Lapl2 Lapl3 Lapl4 Lapl5 Lapl6 Lapl7 Lapl8 Lapl9 LaplAvg LaplMax LaplMin PIP1 PIP10 PIP11 PIP12 PIP13 PIP14 PIP15 PIP16 PIP17 PIP18 PIP19 PIP2 PIP20 PIP3 PIP4 PIP5 PIP6 PIP7 PIP8 PIP9 PIPAvg PIPMax PIPMin piV SIDel(G)N SIDel(K)N SIDel(Rho)N SIEP SIEPA1 SIEPA10 SIEPA2 SIEPA3 SIEPA4 SIEPA5 SIEPA6 SIEPA7 SIEPA8 SIEPA9 SIEPIA SIEPMax SIEPMin SIG SIGA1 SIGA10 SIGA2 SIGA3 SIGA4 SIGA5 SIGA6 SIGA7 SIGA8 SIGA9 SIGIA sigmanew sigmaNV sigmaPV SIGMax SIGMin SIK SIKA1 SIKA10 SIKA2 SIKA3 SIKA4 SIKA5 SIKA6 SIKA7 SIKA8 SIKA9 SIKIA SIKMax SIKMin sumsigma SurfArea Volume CAQSOL CHEM_POT CLOGP CMR CSAREA_A CSAREA_B CSAREA_C DELTAHF DIPOLE ETOT EVDW HARDNESS HBAB HBDA HOMO LENGTH_A LENGTH_B LENGTH_C LUMO MASS MUA MUB MUC NHBA NHBD NUMHB PISUBI QMINUS QPLUS RA RB RC SAAB SAAC SABC SAREA SASAREA SASVOL SHAPE VLOOP1 VLOOP2 VLOOP3 VLOOP4 VLOOP5 VOLUME

    CO1

    CO2

    CO3

    CO4

    CO5

    CO6

    CO7

    CO8

    CO9

    C10

    . . .

    C66

    T1C11

    T2C1

    T2C11

    T2C13

    T2C16

    T2C1

    T2C9

    LCCKALog(10) of inhibition concentration

    for "A" receptor site on Cholecystokinin

  • 1 ) Surface properties are encoded on 0.002 e/au3 surface Breneman, C.M. and Rhem, M. [1997] J. Comp. Chem., Vol. 18 (2), p. 182-1972 ) Histograms or wavelet encoded of surface properties give TAE property descriptorsElectron Density-Derived TAE-Wavelet Descriptors

  • Validation Model: 100x leave 10% out validations

  • PLS, K-PLS, SVM, ANNData StripMining Approach for Feature Selection

  • Kernel PLS (K-PLS) Introduced by Rosipal and Trejo (J. Machine Learning, December 2001) K-PLS gives almost identical (but more stable) results to SVMs for QSAR data - K-PLS is more transparent. - K-PLS allows to visualize in SVM Space - Computationally efficient and few heuristics - There is no patent on K-PLS Consider K-PLS as a better nonlinear PLS

  • Binding affinities to human serum albumin (HSA): log Khsa Gonzalo Colmenarejo, GalaxoSmithKline J. Med. Chem. 2001, 44, 4370-4378 95 molecules, 250-1500+ descriptors 84 training, 10 testing (1 left out) 551 Wavelet + PEST + MOE descriptors Widely different compounds Acknowledgements: Sean Eakins (Concurrent) N. Sukumar (Rensselaer)

  • GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGATGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTATCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCATGTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAGCTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATGGATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACACATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCCATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATCATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTGTCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCATTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCACCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCACCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCACCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCACCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCACCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCATCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCACTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCACCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCACCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGTATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCTTTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAACCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGTGATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGATGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTATCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCATGTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAGCTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATGGATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACACATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCCATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATCATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTGTCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCATTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCACCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCACCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCACCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCACCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCACCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCATCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCACTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCACCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCACCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGTATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCTTTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAACCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGTWORK IN PROGRESS

  • APPENDIX: Downloading and Installing the JAVAand the JAVA Runtime Environment To be able to make JAVA plots, the installation of JRE (the JAVA Runtime Environment is required. The current version is the JAVA 2 Standard Edition Runtime Environment 1.4 This provides complete runtime support for JAVA 2 applications. In order to build a JAVA application you must download SDK. The JAVA 2 SDK is a development environment for building applications, applets, and components using the JAVA programming language. The current version of JRE or JDK for a specific platform can be downloaded from the following site:

    http://java.sun.com/j2se/1.4/download.html

    Make sure you set a path to the bin folder in the autoexec.bat file (or equivalent for WindowsNT/XT or LINUX/UNIX.

  • Performance Indicators The RPI definitions include r2 and R2 for the training set and q2 and Q2 for the test set. r2 is the correlation coefficient and q2 is 1-the correlation coefficient for the test set.

    R2 is defined as

    Q2 is defined as R2 for the test setNote iv) In bootstrap mode q2 and Q2 are usually very close to each other, significant differences between q2 and Q2 often indicate an improper choice for the krnel width, or an error in data scaling/pre-processing

    How do we sample the the molecular surface?