1 dna classifications with self-organizing maps (soms) thanakorn naenna mark j. embrechts robert a....
TRANSCRIPT
![Page 1: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/1.jpg)
1
DNA Classifications with Self-Organizing Maps (SOMs)
Thanakorn NaennaMark J. EmbrechtsRobert A. Bress
May 2003 IEEE International Workshop on Soft Computing in Industrial Application
![Page 2: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/2.jpg)
2
Presentation Outline
• Introduction to DNA Splice Junctions• Data Collection• Introduction to SOMs• SOM for DNA Splice Junction
Classification• Results• Conclusions
![Page 3: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/3.jpg)
3
![Page 4: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/4.jpg)
4
Human genome in a nutshell
• Human : 23 chromosomes• Chromosomes thousands of genes• Gene info : exons , comments : introns
Splice junction are like /* comment flags */ in C-code• Exons and introns codons• Codon bases
![Page 5: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/5.jpg)
5
DNA Splice Junctions
• DNA billions of nucleotides ( A, C, G, T)• Genes sequences of amino acids (exons) that are often
interrupted by non-coding nucleotides (introns) • <.1% of human DNA is made up of exons• 99% of splice junctions have the same motif, for
– Exon to intron it is GT– Intron to exon it is AG
….GTGAAGGTTAA AGATGTAGAT GT ATTG…
Splice Junction Splice JunctionExonIntron Intron
![Page 6: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/6.jpg)
6
Data Collection: HTML Browser + Perl scripts
BioBrowser
Download HTML ExtractLinks() Download HTML - data
ExtractData()
TranslateData()
![Page 7: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/7.jpg)
7
![Page 8: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/8.jpg)
8
DNA Splice Junction (Cont.)
• A complete gene is made up of different exons• Splice junction identification aids in the discovery of new genes• The dataset used for this study is made up of 1,424 sequences• Data were created ab initio from GENBANK• Each sequence is 32 nucleotides long with regions comprising -15 to +15
nucleotides from the splice-junction
…TGTAAGG AG ACGAGTT…Intron
Splice Junction Exon
Left Regions Splice Junction Right Regions ClassesIntron AG Exon AExon GT Inron B
Unknown AG or GT Unknown C
![Page 9: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/9.jpg)
9
Self-Organizing Maps (SOM) Network
• Unsupervised learning neural network
• Projects high-dimensional input data onto two-dimensional output map
• Preserves the topology of the input data
• Visualizes structures and clusters of the data
c
i 1iw
3iw
4iw
5iw
1cw 2cw
3cw 4cw
5cw
Input layer Output layer
Component 1
Component 3
Component 5
Component 2
Component 4
2iw
![Page 10: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/10.jpg)
10
Use of SOM for DNA Splice Junction Classification Model
SOM
SOM Classification Map
Classification
Class A: intron to exon
Class B: exon to intron
Class C: no transition
Classification
Class A: intron to exon
Class B: exon to intron
Class C: no transition
DNA training set
DNA test set
Neuron identification methods
- Highest frequency class
- Closest neuron
Neuron identification methods
- Highest frequency class
- Closest neuron
A
BC
U-Matrix Map
![Page 11: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/11.jpg)
11
The U-matrix of the DNA Training Set
![Page 12: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/12.jpg)
12
SOM Results for DNA Splice Junction Data
A
B
C
DNA sequences Class A Class B Class C TotalClass A 102 (93%) 2 (2%) 6 (5%) 110Class B 0 (0%) 90 (91%) 9 (9%) 99Class C 4 (2%) 6 (3%) 205 (95%) 215Total 106 98 220 424
Classified to
Confusion matrix of 424-DNA test set
The U-matrix of the DNA training set
![Page 13: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/13.jpg)
13
Conclusions
• SOM is effective in DNA splice junction classification• SOM is powerful visualization for high dimensional data
![Page 14: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/14.jpg)
14
Demo with Analyze Code
• 800 training data, 324 test data (160 features)• 96% correct overall classification on test data
IE FALSE EI
IE 98 0 0FALSE 5 111 3
EI 2 3 102
Confusion Matrix
9186
2000050000
0.90.05
1 // K// L// max_neighborhood// num_its// num_fine_its// alpha_max// alpha_min// LVQ_flag
![Page 15: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/15.jpg)
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA
TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA
TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT
GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG
CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG
GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA
CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC
ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC
ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG
TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA
TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA
CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA
CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA
CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA
CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA
CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA
TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA
CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA
CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA
CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT
ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT
TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA
CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA
TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA
TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT
GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG
CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG
GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA
CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC
ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC
ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG
TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA
TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA
CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA
CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA
CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA
CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA
CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA
TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA
CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA
CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA
CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT
ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT
TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA
CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT
THE END
![Page 16: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/16.jpg)
16
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA
TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA
TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT
GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG
CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG
GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA
CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC
ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC
ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG
TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA
TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA
CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA
CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA
CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA
CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA
CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA
TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA
CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA
CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA
CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT
ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT
TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA
CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT
AAAAGCATTGGGAA
GGTTC
CCGTTGAAC
GGTCAGGTTAGACTA
EXTRACTING KNOWLEDGE
![Page 17: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/17.jpg)
17
NUCLEOTIDES
AA TT
CCGG
• DNA is double-stranded •A & C are Complements
•G & T are Complements
![Page 18: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/18.jpg)
18
AMINO ACIDS
• Sequences of three nucleotides –“CODONS” – code for amino acids
• There are 20 different amino acids
• Amino acids make up the part of DNA known as exons
• Each amino acid can be translated between 1 and 6 different ways
![Page 19: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/19.jpg)
19
PROTEINS
• Proteins are made up of sequences of amino acids• Generally responsible for some biological function
• May have complicated folding patterns that are difficult to predict
![Page 20: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/20.jpg)
20
GENES
• 30,000 – 100,000 genes exist in the human genome
• Most genes have not yet been discovered
• Genes are made up of sequences of amino acids
• Genes are interrupted by non-coding regions of DNA “Introns”
![Page 21: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/21.jpg)
21
CHROMOSOMES
![Page 22: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/22.jpg)
22
READING FRAMES
…ACG TAGAT…
• Reading frames may be difficult to determine
• Reading frames may be shifted by splice junctions
![Page 23: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/23.jpg)
23
GENE STRUCTURE
Start Codon (ATG)
Exon sequence (amino acid string)
Intron sequence (junk DNA)
Stop Codon (3 possible)
![Page 24: 1 DNA Classifications with Self-Organizing Maps (SOMs) Thanakorn Naenna Mark J. Embrechts Robert A. Bress May 2003 IEEE International Workshop on Soft](https://reader030.vdocuments.mx/reader030/viewer/2022032703/56649d225503460f949f82f0/html5/thumbnails/24.jpg)
24
SPLICE JUNCTIONSSPLICE JUNCTIONS
• Segments of DNA that join coding and non-coding regions