david goldberg cs 1950 directed study
DESCRIPTION
David Goldberg CS 1950 Directed Study. RNA Sequence. Exon. Down Intron. Up Intron. GATTACACATGCCGTAG. CCCACTCCATGATTACAC. CATGCCGTAGCTCATGCC. GCCACGTCTTTTGCTCTTTGCAGGATTACATCACTGGAAACTTTAGCCACGTAAACTTTA. Pattern 1:ACATCAC Pattern 2:ACGT. Desired Upgrades. Current Program: - PowerPoint PPT PresentationTRANSCRIPT
David GoldbergCS 1950 Directed Study
RNA Sequence
Up IntronExon Down Intron
CCCACTCCATGATTACAC
CATGCCGTAGCTCATGCC
GATTACACATGCCGTAG
GCCACGTCTTTTGCTCTTTGCAGGATTACATCACTGGAAACTTTAGCCACGTAAACTTTA
Pattern 1:ACATCAC Pattern 2:ACGT
Desired Upgrades
Possible Problems
• Programming in Perl• Extensive Use of Regular Expressions• Trouble figuring out exactly what is needed to
be done• Don’t know if what we want to be done can be
done
Old Program:• Command line arguments only (2 patterns)• Cannot Use Y or R or N• Only Checks Human RNA for patterns• Has static search length • Result file displays Human RNA id, mouse RNA id, and last 75 characters.• Only Searches Down IntronNew Program:• Prompts user for inputs:
–Path of database with default– 2 patterns–Minimum and Maximum distance between patterns– Searches from either 3’ splice site(beginning) or 5’ splice site(end)– Length from beginning or end to search–Which part to search(down intron, exon, up intron)
• Will find matches in either the Human RNA, Mouse RNA or both• Result file displays Human RNA id, Mouse RNA id, sequence searched, 1st
pattern found, sequence in between 1st pattern and 2nd pattern, and 2nd pattern
Old Program Results FilePattern1=ACG Pattern2=TThumanIDmouseID
ENSG00000124721_61 ENSMUSG00000033826_64 CENA GTAAGTTTTTATTTTTATTTATATCTACGTAGAAAGAGTTCCTTATTTAAAGGTGCTTAGTTTGCCTTCTCTGAT
ENSG00000113569_8 ENSMUSG00000022142_8 CENA GTAAGTAGAAAACAATAAATTTGGCAAGTACAACTAATTTCTAACACATTGTTCCCTCAACGTTTTCTTCAGAAA
ENSG00000105323_14 ENSMUSG00000040725_13 CENA GTGAGAGAATGAGTGTGTGTTTGTATGTAGTGATCGCACGTGTGCTTTTGAACCTGAGCAAGTTAGGTGGAGGCG
...
New Program Results FilePattern1=ACG Pattern2=TT Search=up SITE=3'humanIDmouseIDENSG00000134690_4 SE CTACAACGTTCTTTTTAAAG ACG TTENSMUSG00000028873_3 SE Not Found
ENSMUSG00000026954_6 CENA TTTTATTCATACGCTTACAG ACG C TTENSG00000115145_5 CENA Not Found
ENSG00000124721_67 CENA CCACGTCTTCTTCTTTTCAG ACG TC TTENSMUSG00000033826_70 CENA Not Found
ENSG00000052126_20 CENA ACGTTTTCTAATATTCCCAG ACG TTENSMUSG00000030231_11 CENA Not Found
ENSG00000138468_2 SE CACGTCTTTGGTTTTTGTAG ACG TC TTENSMUSG00000022591_2 SE TACGTCTTTCATTTTTGTAG ACG TC TT
ENSG00000151376_4 CENA ACGTGTTTTATTTCTTTTAG ACG TG TTENSMUSG00000030621_4 CENA Not Found...
Exon, Intron Program
• Wanted a program that searched the end of the down intron and beginning of the exon.
• The first pattern would be in the intron.• The second pattern would be in the exon.• Exon usually start with a GT pattern so if it
starts with that it should ignore that part in the pattern matching, but if the GT is not present it should still try to match the 2 patterns.
RNA Sequence
Up IntronExon Down Intron
CCCACTCCATGATTACAC
CATGCCGTAGCTCATGCC
GATTACACATGCCGTAG
ACTCCATGATTACAC GATTACACATG
Pattern 1:GATT Pattern 2:ACAT
Exon, Intron Program•Prompts user for inputs:–Path of database with default–2 patterns–Minimum distance between patterns–Will find matches in either the Human RNA, Mouse RNA or
both•Result file displays Human RNA id, Mouse RNA id, small
part of the down intron before first pattern, 1st pattern found, sequence in between 1st pattern and end of down intron, the GT sequence if it was at the start of the exon, the beginning of the exon until the 2nd pattern, and 2nd pattern, small part of the exon after the 2nd pattern, the length of the pattern in between the 1st pattern and the end of the intron, the length of the pattern between the start of the exon and the 2nd pattern.
RNA Sequence
Up IntronExon Down Intron
CCCACTCCATGATTACAC
CATGCCGTAGCTCATGCCGTATTACACATGCCGTA
Pattern 1:GATT Pattern 2:ACAT
ACTCCATGATTACAC GTATTACACATGCCGTA
54
Exon, Intron Program Results FilePattern1=ACTG Pattern2=TTAC Max Space:15humanID mouseIDENSG00000163872_13 ENSMUSG00000041215_12 CENA TGTAACATCT
ACTG TCAAG GT AACATTC TTAC TGCGTT 5 7ENSG00000135390_3 ENSMUSG00000010371_3 CENA GGAGAT ACTG
ACAGATGAG GT ACC TTAC AGTGGAGTTG 9 3ENSG00000103876_8 ENSMUSG00000030630_8 CENA CTTATGAACG
ACTG GAGTG GT AA TTAC TGGAGCTCTGC 5 2ENSG00000156253_3 ENSMUSG00000041079_3 SE TGCCTGAAATT
ACTG TCAG GT ACG TTAC AGAAGCTCTG 4 3ENSG00000151490_18 ENSMUSG00000030223_18 CENA AGAAGAGGAA
ACTG ACAAA GT AAGTTTTTC TTAC TATG 5 9...