hkuhku computer centre introduction to emboss christine ho [email protected]
Post on 22-Dec-2015
217 views
TRANSCRIPT
HKU
ComputerCentre
Web page of EMBOSS The programs of EMBOSS is available at http://bioinfo.hku.hk
/EMBOSS/ The files required for this lecture is available at http://bioinfo.hku.hk/tutorial/ User required to apply for a BIOINFO account to use the tools
on the web and off-line, and to download the databases. BIOINFO account is open freely to the public to register, and u
sage on the BIOINFO is restricted for academic and research purposes only.
How to apply BIOINFO account: HKU members: Submit the HKUESD application Form(Cfe-
139) Non-HKU members: submit the application form of
http://www.hku.hk/ccoffice/forms/cf139.pdf Question and comment: [email protected]
HKU
ComputerCentre
What is EMBOSS? EMBOSS (The European Molecular Biology Open
Software Suite) is a free Open Source software analysis package that provides a comprehensive set of sequence analysis package specially developed for the needs of the molecular biology user community.
Within EMBOSS you will find around 100 programs (applications).
More information about EMBOSS can be found at http://www.uk.embnet.org/Software/EMBOSS/
HKU
ComputerCentre
Main Programs in EMBOSS Retrieve sequences from database Sequence alignment Nucleic gene finding and translation Protein secondary structure prediction Rapid database searching with sequence patterns Protein motif identification, including domain
analysis Nucleotide sequence pattern analysis, for example
to identify CpG islands or repeats. Codon usage analysis for small genomes Rapid identification of sequence patterns in large
scale sequence sets Presentation tools for publication
HKU
ComputerCentre
Starting EMBOSS
There are three ways to start EMBOSSCommand line after login bioinfo.hku.hkWeb interface (EMBOSS-GUI)
HKU
ComputerCentre
Command line of EMBOSS
Inside HKU campustelnet bioinfo.hku.hk
Outside HKU campusWindows machine
Use putty, see http://bioinfo.hku.hk FAQ Q13
Linux or UNIX machinessh <username>@bioinfo.hku.hk
HKU
ComputerCentre
Web interface of EMBOSS Directly access the web page at
http://bioinfo.hku.hk/EMBOSS/ Or browse the BIOSUPPORT Homepage:
http://bioinfo.hku.hk/ and select “Tools” Option
HKU
ComputerCentre
Web interface of EMBOSS Click on the link EMBOSS - GUI
HKU
ComputerCentre
Programs in EMBOSSParameters in EMBOSS Input can be:
Uniform Sequence Addresses (USAs) path in the format:
database database:entry_name or database:accession_number
(e.g. embl:xlrhodop or embl:L07770) database:wildcard (sw:opsd_a*) filename filename:entry format::filename @list
The sequence data to be pasted in the text area.
HKU
ComputerCentre
Programs in EMBOSS
Output will be:Textual and/or graphical representation
of data.The output can be saved as text file or
in some cases image file in PNG or PS format.
HKU
ComputerCentre
EMBOSS online help The documentation for EMBOSS is available
at http://bioinfo.hku.hk/emboss/
HKU
ComputerCentre
Difference between GCG and EMBOSS
GCG EMBOSSFile format supported
GCG, MSF, RSF, FastA, BLAST (Other file format must be converted using program (e.g. FromFastA, FromEMBL, FromPIR, etc)
ABI trace file, ACeDB, Clustal ALN (multiple alignment), EMBL, FASTA, GENBANK, NBRF (PIR), PHYLIP interleaved multiple alignment, SWISSPROT, Plain text, etc
No. of sequence in one file
One file can only have one sequence.
One file can have multiple sequence.
3rd party package included
FASTA, BLAST FASTA, BLAST, Assembly program not included. They must be run separately
Upper limit of sequence size
35K 2G
HKU
ComputerCentre
Replacement of GCG programs
Exchanging sequences between packages
In GCG In EMBOSS
getseq Newseq
Fromfasta, tofasta, fromembl, toembl
From…, to… (any program that reads/writes sequences)
seqret
HKU
ComputerCentre
Replacement of GCG programs Sequence editing, manipulation and display
In GCG In EMBOSSfetch Seqret
Seqed
command delete
command insert
No complete solution yet
cutseq
pasteseq
lineup No good solution yet
assemble union
shuffle shuffleseq
reverse Revseq
chopup Not needed as EMBOSS reads ‘any’ format
publish Showseq, prettyseq
HKU
ComputerCentre
Replacement of GCG programs
Sequence comparison and alignment
In GCG In EMBOSScompare+dotplot (default (window stringency))
Compare+dotplot (word=n)
Dotmatcher
dottup
Gap Needle, stretcher (for long sequences)
bestfit Water, matcher (for long sequences)
Pileup, clustal Emma (=CLUSTAL)
pretty Cons, showalign
Translation
In GCG In EMBOSStranslate transeq
HKU
ComputerCentre
Replacement of GCG programs Patterns and gene finding
In GCG In EMBOSS
Findpatterns Fuzznuc, fuzztrans, fuzzprot
NB: uses PROSITE syntax (not GCG) to define pattern
motifs Patmatmotifs
NB: ps_scan searches also PROSITE profiles
codonpreference Syco, wobble
HKU
ComputerCentre
Replacement of GCG programs Phylogeny
In GCG In EMBOSSdistances+growtree Ednadist or eprotdist+ eneighbor
In GCG In EMBOSSMap-With option “Find translationally silent potential restriction sites”-With option options 3’ or 5’ overhang
Remap, restrict
Silent
restover
Mapsort
Mapsort+plasmidmap
Restrict
Cirdna (only partial solution: input file with Tick positions must be created “manually”
Mapping
HKU
ComputerCentre
Replacement of GCG programs
Protein analysis
In GCG In EMBOSSPepplot, peptidestructure+plotstructure
Garnier, pepinfo, octanol, pepwindow
Primer selection
In GCG In EMBOSS
prime Eprimer3 (=Primer3)
Primepair, melttemp No good solution yet
HKU
ComputerCentre
Replacement of GCG programs Keyword-based databank searching
In GCG In EMBOSS
Names Whichdb
Indexsearch Indexsearch
Stringsearch (mode A)
Stringsearch (mode B)
Textsearch
No good solution yet but advantageously replaceable by indexsearch
HKU
ComputerCentre
Running EMBOSS program EMBOSS programs are run by typing them
at the Unix prompt, or by using an interface.
The EMBOSS command syntax follows normal Unix command conventions.
Programname -help to get some help on the options.
Programname -opt to make the program prompt you for common
options. tfm programname
to get the full help on a program.
HKU
ComputerCentre
Login bioinfo Login bioinfo with ‘telnet bioinfo.hku.hk’ If you are using the temp account, please create a
directory of your username at hkusua: bioinfo% mkdir <username> E.g. bioinfo% mkdir chantaiman
Change directory to your created directory Bioinfo% cd <username> E.g. bioinfo% cd chantaiman
HKU
ComputerCentre
wossname
It is easy to forget the name of a program.
To find EMBOSS programs, use wossname
wossname finds programs by looking for keywords in the description or the name of the program.
HKU
ComputerCentre
wossname Type wossname at the Unix % prompt
bioinfo % wossname Displays one-line description. Prompts you for information:
Finds programs by keywords in their one-line documentation Keyword to search for: restrict
SEARCH FOR 'RESTRICT’recode Remove restriction sites but maintain the
same translationremap Display a sequence with restriction cut
sites, translation etc…..
HKU
ComputerCentre
Optional parameters To get prompted for all the optional parameters, type
the following:
bioinfo % wossname -optFinds programs by keywords in their one-line
documentationKeyword to search for: proteinOutput program details to a file [stdout]: myfileFormat the output for HTML [N]: String to form the first half of an HTML link:String to form the second half of an HTML link:Output only the group names [N]:Output an alphabetic list of programs [N]:Use the expanded group name [N]:
HKU
ComputerCentre
helpbioinfo % wossname -help Mandatory qualifiers:
[-search] string Enter a word or words here.
Optional qualifiers (* if not always prompted):
-outfile outfile this program will write the program names
Advanced qualifiers:
-[no]emboss bool EMBOSS program
documentation will be searched.
Mandatory - required, are often parameters (in ‘[]’) Optional - use -opt to be prompted for these. Advanced - things that are not often used!
HKU
ComputerCentre
Writing to the screen
Note that the default output file for wossname was:stdout (Standard output)
Use this whenever prompted for an output file.
This is a ‘magic’ file name. It displays the output on the screen,
not a file.
HKU
ComputerCentre
Working with sequences
EMBOSS reads sequences from files or databases.
It automatically recognizes the input sequence format.
You can easily specify many output formats.
HKU
ComputerCentre
Getting sequences from the databases Database single entry (ID)
database:entryFor example embl:hsfau
Wildcarded entries (Query)database:hs*For example sw:fos_*
All entriesdatabase:*
Most databases will support all 3 methods - some may not.
HKU
ComputerCentre
showdbbioinfo% showdb
Displays information on the currently available databases
# Name Type ID Qry All Comment
# ==== ==== == === === =======
domo P OK OK OK DOMO sequences
enspep P OK OK OK ENSEMBL PEP sequences
gp P OK OK OK GENPEPT sequences
gpnew P OK OK OK New GENPEPT sequences
kabatp P OK OK OK KABAT Protein sequences
nrl P OK OK OK NRL_3d
pdb P OK OK OK PDB sequences
pir P OK OK OK PIR using NBRF access for 4 files
rem P OK OK OK REMTREMBL sequences
HKU
ComputerCentre
seqret Reads in a sequence, and writes it out.
bioinfo % seqretReads and writes (returns) a sequence
Input sequence: embl:xlrhodop Output sequence [xlrhodop.fasta]:
bioinfo % more xlrhodop.fasta
>XLRHODOP L07770 Xenopus laevis rhodopsinggtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaaaaa
agaaacacagaaggcattctttctatacaagaaaggactttatagagctgctaccatgaa
cggaac . .
HKU
ComputerCentre
seqret from the command line
Give seqret all of its data on the command-line.
It doesn’t need to prompt for anything else.
bioinfo % seqret embl:xlrhodop -outseq xlrhodop.fasta
The ‘-outseq’ can be abbreviated to ‘-out’. Any abbreviation must be unique. Even shorter, leave out the qualifier:bioinfo % seqret embl:xlrhodop xlrhodop.fasta
HKU
ComputerCentre
Changing output formats (reformatting)seqret can reformat sequences by
specifying the output format:
bioinfo % seqret embl:xlrhodop xlrhodop.gcg -osformat gcgbioinfo % more xlrhodop.gcg
!!NA_SEQUENCE 1.0Xenopus laevis rhodopsin mRNA, complete cds.XLRHODOP Length: 1684 Type: N Check: 9453 .. 1 ggtagaacag cttcagttgg gatcacaggc ttctagggat
cctttgggca 51 aaaaagaaac acagaaggca ttctttctat acaagaaagg
actttataga . .
HKU
ComputerCentre
Multiple sequences, single filesYou can use seqret to retrieve multiple
sequences into a file:
bioinfo% seqret “sw:opsd_a*” opsd_a.seqs
This retrieves all the sequences whose identifiers start with “opsd_a” into a file called opsd_a.seqs.
HKU
ComputerCentre
Multiple sequences, many files If you wish to write one sequence per
file, use:bioinfo % seqret “sw:opsd_a*” -ossingle
The output filenames will be based on the sequence entry names.
The program seqretsplit will split an existing multiple sequence file into many files.
HKU
ComputerCentre
Asterisk on the command line You can't use a ‘*’ on the UNIX command-
line. UNIX tries to match it to filenames. Use it quoted, either with quotes or a
backslash:"embl:*"embl:\*
For example:bioinfo % seqret “embl:hsf*” hsf.seq
HKU
ComputerCentre
EMBOSS web interface On the left, you can choose the program to run. You
can also see all the program sorted alphabetically instead of sorted by group by clicking on the link.
HKU
ComputerCentre
Getting help in EMBOSSHelp on the program is available by
clicking on the question mark.
HKU
ComputerCentre
Input to EMBOSS If you know the entry_name or accession number,
enter the sequence in the Uniform Sequence Addresses (USAs) format E.g. embl:xlrhodop
HKU
ComputerCentre
Input to EMBOSS If you have your own sequence file,
upload the sequence by clicking the browse button.
HKU
ComputerCentre
Input to EMBOSSYou can also copy and paste your
own sequence into the text area.
HKU
ComputerCentre
seqret web interface
E.g. seqret - retrieving single sequenceInput:
USA path embl:xlrhodopOutput file format: GCG 9.x/10.x
Output:The sequence retrieved in GCG
format
HKU
ComputerCentre
seqret
HKU
ComputerCentre
seqret
HKU
ComputerCentre
seqret Seqret – retrieving multiple sequences
Input: sw:ops2_*. Output file format: Pearson FASTA Output: multiple sequences with the identifier starting with
sw:ops2_. Save the file as ops2.fasta by right clicking on the link
HKU
ComputerCentre
coderet Extract CDS, mRNA and translations from feature
tables. If any sequences are in other entries of that database, they are automatically fetched and incorporated correctly into the final sequence.
Input: embl:X03487
HKU
ComputerCentre
coderetOutput
HKU
ComputerCentre
dottup dottup – Comparison between 2 sequences using
dot-plots. Input:
1st sequence: embl:xl23808 (Xenopus laevis rhodopsin gene)
Second sequence: embl:xlrhodop (Xenopus laevis rhodopsin cDNA from complement of mRNA)
Output:A dotplot showing the diagonal lines
representing areas where the two sequences align well in PNG format.
The image can be saved into the computer.
HKU
ComputerCentre
dottup
HKU
ComputerCentre
dottup The 5 diagonal lines represent areas where the two
sequences align well. Since this is aligning genomic and cDNA, the five diagonals
represent the five exons of the gene.
HKU
ComputerCentre
Pairwise Sequence Alignment
An alignment is an arrangement of two sequences which shows where the two sequences are similar, and where they differ.
There is no unique, precise, or universally applicable notion of similarity.
HKU
ComputerCentre
Global Alignment
A global alignment is one that compares the two sequences over their entire lengths, and is appropriate for comparing sequences that are expected to share similarity over the whole length.
The alignment maximizes regions of similarity and minimizes gaps using the scoring matrices and gap parameters provided to the program.
HKU
ComputerCentre
needleFunction
Needleman-Wunsch global alignmentDescription
This program uses the Needleman-Wunsch global alignment algorithm to find the optimum alignment (including gaps) of two sequences when considering their entire length.
The computation is rigorous.It can be time consuming to run if the
sequences are long.
HKU
ComputerCentre
Input sequence for needle
HKU
ComputerCentre
needle needle - Needleman-Wunsch global alignment
Input:1st sequence: embl:xlrhodop, 2nd sequence: embl:xl23808 Output: Global alignment showing the 5 aligned regions.
HKU
ComputerCentre
Local alignment
Local alignment searches for regions of local similarity and need not include the entire length of the sequences.
Local alignment methods are very useful for scanning databases or other circumstances when you wish to find matches between small regions of sequences, for example, between protein domains.
HKU
ComputerCentre
water
FunctionSmith-Waterman local alignment.
DescriptionWater uses the Smith-Waterman
algorithm (modified for speed enhancements) to calculate the local alignment.
HKU
ComputerCentre
water water - Smith-Waterman local alignment.
Input:1st sequence: embl:xlrhodop, 2nd sequence: embl:xl23808 Output: Local alignment showing the 5 aligned region.
HKU
ComputerCentre
Multiple Sequence AnalysisMultiple sequence alignments are used
To find patterns to characterize protein families.
To detect or demonstrate homology between new sequence and existing families of sequences.
To help predict the secondary and tertiary structures of the new sequences.
As an essential prelude to molecular evolutionary analysis.
HKU
ComputerCentre
emmaFunction
Multiple alignment program - interface to ClustalW program
Description EMMA calculates the multiple alignment
of nucleic acid or protein sequences according to the method of Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). This is an interface to the ClustalW distribution.
HKU
ComputerCentre
Upload file to emma Input: output from seqret (ops2.fasta) retrieving all
swissprot sequences whose identifiers begin with sw:ops2_*
Click on browse button to upload the file ops2.fasta
HKU
ComputerCentre
Input sequence to emma ops2.fasta
HKU
ComputerCentre
emma emma – interface to ClustalW program
Output: multiple alignment saved as file ops2.aln.
HKU
ComputerCentre
prettyplot Prettyplot – displays aligned sequences, with colouring and
boxing Input: output from program emma ops2.aln Output: graphic display of aligned sequences. Identical residues in
red, similar residues in green.
HKU
ComputerCentre
prophecyFunction
Creates matrices/profiles from multiple alignments
Description This creates a profile matrix file from a
nucleic acid or a protein sequence alignment.
The profile matrix file can then be used by program profit or prophet.
HKU
ComputerCentre
prophecy Input:
Sequence: output from program emma ops2.aln
Select type: Gribskov
HKU
ComputerCentre
prophecy Output: A profile to be saved as ops2.prophecy.
This profile allows a new sequence to be aligned optimally to a family of similar sequences in the program prophet.
HKU
ComputerCentre
prophet Prophet – Gapped alignment for profiles
Input: Input sequence: The file xlrhodop.pep, output from
transeq of the sequence embl:xlrhodop from 110-1171 region.
Profile or matrix file: ops2.prophecy Output file: ops2.prophet
Output: The gapped alignment to profile. The vertical bars (|) represent residues that are identical between the ops2 consensus and our rhodopsin, while the colons (:) represent conservative substitutions. Aligning members of a family can reveal conserved regions that may be important for structure and/or function.
HKU
ComputerCentre
prophetOutput
HKU
ComputerCentre
plotorf plotorf – plots potential opening reading frames
Input sequence: embl:xlrhodop Output: graphical output showing the potential opening reading
frames in all six frames. The longest protein is in second frame. The correct open reading frame is the second frame.
HKU
ComputerCentre
getorf getorf - Finds and extracts open reading frames (ORFs)
Input: Sequence: embl:xlrhodop Type of sequence to output: Nucleic sequence between
START and STOP codons Output: Textual information of the region and the sequence of that
region.
HKU
ComputerCentre
transeq transeq - Translate nucleic acid sequences
Input: sequence: embl:xlrhodop regions to translate: 110-1171 (from information of getorf)
Output: Translated sequence of the given region. Save the file as xlrhodop.pep
HKU
ComputerCentre
Exercise 1 Q1 Align HER2 _ERB2_HUMAN and
UNKNOWN_AAL39899.1 with needle and water. What is the main difference between the two types of alignment in these two cases (the files HER2-fasta.prt and ALL39899_1.prt are at http://bioinfo.hku.hk/tutorial/)?
Repeat the Smith-Waterman alignment of HER2-fasta.prt and ALL39899_1.prt with different parameters. What happens if gap penalties are changed to 30 and 2 instead of the defaults 10 and 0.5?
BLOSUM62 is default. What happens to the local alignment (using program water) when using other matrices, e.g. EPAM10?
HKU
ComputerCentre
Exercise 1 Q2
Type gb:A7120FTSZ in the text box and run seqret. Run entret with the same sequence USA and examine the entry. What is the difference between the two entries?
HKU
ComputerCentre
Exercise 1 Q3
With the program infoseq, display information on all sequences whose name starts with ‘10’ in the SwissProt database. (hint: the sequence is sw:10*, choose the information you want to display by changing to ‘yes’)
HKU
ComputerCentre
Exercise 1 answer (A1) Needle output
HKU
ComputerCentre
Exercise 1 answer (A1) Water output
HKU
ComputerCentre
Exercise 1 answer (A1) Water output with gap opening penality of
30 and gap extension penality of 2.
HKU
ComputerCentre
Exercise 1 answer (A1) Water output with matrix of EPAM10
HKU
ComputerCentre
Exercise 1 answer (A1) The global alignment (needle) require the whole
sequences to be aligned. The % identity and % similarity is much less than local alignment (water).
If the gap penalties are changed to 30 and 2, no gap appears in the alignment
If EPAM10 is used, the score and alignment length drops. Since PAM is derived from global alignment, it gives worser result for the local alignment program water. EPAM10 is more suitable for very similar protein with no more than 10% evolutionary divergent.
HKU
ComputerCentre
Exercise 1 answer (A1)
Amino Acid substitution matrices PAM (percent accepted mutation) – lists the
likelihood of change from one amino acid to another in homologous sequences during evolution.
One PAM is a unit of evolutionary divergence in which 1% of the amino acids have been changed.
some amino acid substitutions occurred more readily than others, probably because they did not have a great effect on the structure and function of a protein.
HKU
ComputerCentre
Exercise 1 answer (A1)
Amino Acid substitution matrices (con’t) BLOSUM – matrix values are based on a large
set of ~2000 conserved amino acid patterns called blocks. Blocks come from a database of protein sequences representing more than 500 families of related proteins.
PAM is derived from global alignments of proteins, while BLOSUM comes from alignments of shorter sequences.
The matrix built from blocks with no more than x% of similarity is called BLOSUM X
HKU
ComputerCentre
Exercise 1 answer (A1) PAM100 ==> Blosum90 PAM120 ==> Blosum80 PAM160 ==> Blosum62 PAM200 ==> Blosum52 PAM250 ==> Blosum45 The Blosum matrices are best for detecting
local alignments. The Blosum62 matrix is the best for
detecting the majority of weak protein similarities.
The Blosum45 matrix is the best for detecting long and weak alignments.
HKU
ComputerCentre
Exercise 1 answer (A1) If the BLOSUM62 matrix is compared to PAM160
then it is found that the BLOSUM matrix is less tolerant of substitutions to or from hydrophilic amino acids, while more tolerant of hydrophobic changes and of cysteine and tryptophan mismatches.
HKU
ComputerCentre
Exercise 1 answer (A2)seqret output
HKU
ComputerCentre
Exercise 1 answer (A2)entreq output
HKU
ComputerCentre
Exercise 1 answer (A2)
You will see the sequence for the Anabaena 7120 ftsZ and gsh-III genes.
EMBOSS is also capable of extracting more information than just the sequence from a database entry. The program entret will return the entire entry as a text file.
HKU
ComputerCentre
Exercise 1 answer (A3) Output
HKU
ComputerCentre
garnier Garnier - Predicts protein secondary structure using the
Garnier-Osguthorpe-Robson (GOR) method Secondary structure prediction is notoriously difficult to do
accurately. The GOR I alogorithm is one of the first semi-successful methods.
The Garnier method is not regarded as the most accurate prediction, but is simple to calculate on most workstations. Input: translated sequence (xlrhodop.pep) embl:xlrhodop
from 110-1171 region with program transeq. Output: Predicted protein secondary structure
HKU
ComputerCentre
garnierOutput
HKU
ComputerCentre
pepinfo pepinfo - Plots simple amino acid properties in parallel.
Input sequence: translated sequence (xlrhodop.pep) embl:xlrhodop from 110-1171 region with program transeq.
Output: A textual and graphical representation of amino acid properties (size, polarity, aromaticity, charge, etc). Hydrophobicity profiles useful for locating turns, potential antigenic peptides and transmembrane helices.
HKU
ComputerCentre
pepinfo Showing the residues distribution
HKU
ComputerCentre
pepinfo Hydrophobicity profiles are useful for locating turns, potential
antigentic peptides and transmembrane helices. positive score -> a hydrophobic region. negative score -> hydrophilic region. show seven highly hydrophobic regions. use the program tmap to investigate further.
HKU
ComputerCentre
patmatmotifs Patmatmotifs – search a PROSITE motif
database with a protein sequence. It can identify to which known family of protein (if any) the new sequence belongs.
PROSITE currently contains patterns and profiles specific for more than a thousand protein families or domains.
PROSITE patterns (Biologically significant amino acid patterns can be summarized in the form of regular expressions)
PROSITE profile (techniques based on weight matrices allows the detection extreme sequence divergence protein families and functional/structural domains)
HKU
ComputerCentre
patmatmotifs Input sequence: The file xlrhodop.pep, which is output
from transeq of the sequence embl:xlrhodop from 110-1171 region.
Output: A textual representation showing where the sequence match with a motif.
HKU
ComputerCentre
pscan Pscan – Scans proteins using PRINTS PRINTS is a database of diagnostic protein
signatures, or fingerprints. Fingerprints are groups of conserved motifs
or elements that together form a diagnostic signature for particular protein families.
An uncharacterised sequence matching all motifs or elements can then be readily diagnosed as a true match to a particular family fingerprint.
Input sequence: The file xlrhodop.pep, which is output from transeq of the sequence embl:xlrhodop from 110-1171 region.
HKU
ComputerCentre
pscanOutput: A textual representation showing where the
short sequences match with the PRINTS database that defines functional protein families.
HKU
ComputerCentre
fuzznuc fuzznuc uses PROSITE style patterns to
search nucleotide sequences. Letter code for pattern
[ACG] stands for A or C or G. {AG} stands for any nucleotides except
A and G. N(3) corresponds to N-N-N, N(2,4)
corresponds to N-N or N-N-N or N-N-N-N. [CG](5)TG{A}N(1,5)C
Input: sequence: embl:hhtetra Pattern: AAGCTT
HKU
ComputerCentre
fuzznucOutput
HKU
ComputerCentre
Exercise 2 Q1
Use tmap to displays membrane spanning regions with the input sequence of xlrhodop.pep ( translated with program transeq from embl:xlrhodop at 110-1171 region). Does the result agree with pepinfo?
HKU
ComputerCentre
Exercise 2 Q2
Use fuzzpro to search sequence: CREAp_m.txt pattern: CXXXXC (the file CREAp_m.txt is from http://bioinfo.hku.hk/tutorial/)
HKU
ComputerCentre
Exercise 2 Q3Use patmatmotifs to find pattern in
swissprot sequences fos_human or fos_rat, and use these pattern to do fuzzpro. Search other fos genes of different organisms. (Hint: Use sw:fos_human for the input; Other organisms: bovin, chick, mouse, sheep.)
HKU
ComputerCentre
Exercise 2 Q4 Sometimes it is better to run the
program fuzznuc in command line because more parameters can be given
In the BIOINFO terminal, type the following (you must put the command in one line in the UNIX prompt):
bioinfo% fuzznuc -sequence=embl:hhtetra-pattern=AAGCTT -mismatch=1 -
complement-outf=outf.out
How is the result different from previous run in web interface?
HKU
ComputerCentre
Exercise 2 answer (A1) Bars are displayed in the plot above the regions
predicted as being most likely to form transmembrane regions
May be seven transmembrane helices in this protein.
Result agree with pepinfo.
HKU
ComputerCentre
Exercise 2 answer (A2)
The symbol ‘x’ is used for a position where any amino acid is accepted.
There, the pattern CXXXXC matches the result patterns of CQFPGC and CMFPGC.
HKU
ComputerCentre
Exercise 2 answer (A2) Patmatmotifs output using sw:FOS_HUMAN
HKU
ComputerCentre
Exercise 2 answer (A3)When run with patmatmotifs, the
sequences sw:FOS_HUMAN and sw:FOS_RAT returns the same motifs of AMIDATION, LEUCINE_ZIPPER, and BZIP_BASIC.
When run with fuzzpro with one of the pattern, the start and end position agrees with patmatmotifs.
HKU
ComputerCentre
Exercise 2 answer (A3) Fuzzpro output with pattern
“GRAQSIGRRGKVEQ” and sequence sw:fos_human
HKU
ComputerCentre
Exercise 2 answer (A4) You can add no. of mismatches in input
parameters for command line. The result with 1 mismatch can now be shown
HKU
ComputerCentre
cpgplot CPGPLOT – Plot the CpG rich areas CpG refers to a C nucleotide immediately
followed by a G. The 'p' in 'CpG' refers to the phosphate group linking the two bases.
By default, this program defines a CpG island as a region where over an average of 10 windows, the
calculated % composition is over 50% and the calculated Obs/Exp (i.e.
Observed/Expected) ratio is over 0.6 and the conditions hold for a minimum of 200
bases. These conditions can be modified by
setting the values of the appropriate parameters.
HKU
ComputerCentre
cpgplot
The Observed number of CpG patterns in a window is simply the count of the number of times a 'C' is found followed immediately by a 'G'.
The Expected frequency of CpG's in a window is calculated as the number of 'C's in the window multiplied by the number of 'G's in the window, divided by the window length.
Expected = (number of C's * number of G's) / window length
HKU
ComputerCentre
cpgplot Input: embl:rnu68037Output
HKU
ComputerCentre
cpgplotOutput
HKU
ComputerCentre
cusp
CUSP reads one or more coding sequences (CDS sequence only) and calculates a codon frequency table.
It is important to use a codon frequency table that is appropriate for the species that your protein comes from.
Input: Seq: embl:paamir Codon usage table: Default (Ehum.cut)
HKU
ComputerCentre
cusp Output: Fract – the faction of all amino acids coded for
this codon triplet. /1000 – the number of codons per 1000 bases
HKU
ComputerCentre
cusp
Running the program in command line allows you to specify the sequence begin and sequence end
bioinfo% cusp -sbeg 135 -send 1292
Create a codon usage table
Input sequence(s): embl:paamir
Output file [paamir.cusp]:
HKU
ComputerCentre
cusp bioinfo% more paamir.cusp
HKU
ComputerCentre
hmoment hmoment plots or writes out the
hydrophobic moment. Hydrophic moment is the hydrophobicity of a peptide measured for a specified angle of rotation per residue.
Assumption: The angle of rotation (bonds of the backbone and amino acid side-chains) per residue in alpha helices is 100 degrees. The angle of rotation per residue in beta sheets is 160 degrees.
Input: Sequence:sw:hbb_human Produce graph: yes Plot two graph: yes
HKU
ComputerCentre
hmomentOutput:
one for the alpha helix moment and one for the beta sheet moment.
HKU
ComputerCentre
End of lectureThank you!