hkuhku computer centre introduction to emboss christine ho [email protected]

119
H K U Computer Centre Introduction to EMBOSS Christine Ho [email protected]

Post on 22-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Introduction to EMBOSS

Christine Ho

[email protected]

Page 2: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Web page of EMBOSS The programs of EMBOSS is available at http://bioinfo.hku.hk

/EMBOSS/ The files required for this lecture is available at http://bioinfo.hku.hk/tutorial/ User required to apply for a BIOINFO account to use the tools

on the web and off-line, and to download the databases. BIOINFO account is open freely to the public to register, and u

sage on the BIOINFO is restricted for academic and research purposes only.

How to apply BIOINFO account: HKU members: Submit the HKUESD application Form(Cfe-

139) Non-HKU members: submit the application form of

http://www.hku.hk/ccoffice/forms/cf139.pdf Question and comment: [email protected]

Page 3: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

What is EMBOSS? EMBOSS (The European Molecular Biology Open

Software Suite) is a free Open Source software analysis package that provides a comprehensive set of sequence analysis package specially developed for the needs of the molecular biology user community.

Within EMBOSS you will find around 100 programs (applications).

More information about EMBOSS can be found at http://www.uk.embnet.org/Software/EMBOSS/

Page 4: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Main Programs in EMBOSS Retrieve sequences from database Sequence alignment Nucleic gene finding and translation Protein secondary structure prediction Rapid database searching with sequence patterns Protein motif identification, including domain

analysis Nucleotide sequence pattern analysis, for example

to identify CpG islands or repeats. Codon usage analysis for small genomes Rapid identification of sequence patterns in large

scale sequence sets Presentation tools for publication

Page 5: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Starting EMBOSS

There are three ways to start EMBOSSCommand line after login bioinfo.hku.hkWeb interface (EMBOSS-GUI)

Page 6: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Command line of EMBOSS

Inside HKU campustelnet bioinfo.hku.hk

Outside HKU campusWindows machine

Use putty, see http://bioinfo.hku.hk FAQ Q13

Linux or UNIX machinessh <username>@bioinfo.hku.hk

Page 7: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Web interface of EMBOSS Directly access the web page at

http://bioinfo.hku.hk/EMBOSS/ Or browse the BIOSUPPORT Homepage:

http://bioinfo.hku.hk/ and select “Tools” Option

Page 8: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Web interface of EMBOSS Click on the link EMBOSS - GUI

Page 9: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Programs in EMBOSSParameters in EMBOSS Input can be:

Uniform Sequence Addresses (USAs) path in the format:

database database:entry_name or database:accession_number

(e.g. embl:xlrhodop or embl:L07770) database:wildcard (sw:opsd_a*) filename filename:entry format::filename @list

The sequence data to be pasted in the text area.

Page 10: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Programs in EMBOSS

Output will be:Textual and/or graphical representation

of data.The output can be saved as text file or

in some cases image file in PNG or PS format.

Page 11: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

EMBOSS online help The documentation for EMBOSS is available

at http://bioinfo.hku.hk/emboss/

Page 12: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Difference between GCG and EMBOSS

GCG EMBOSSFile format supported

GCG, MSF, RSF, FastA, BLAST (Other file format must be converted using program (e.g. FromFastA, FromEMBL, FromPIR, etc)

ABI trace file, ACeDB, Clustal ALN (multiple alignment), EMBL, FASTA, GENBANK, NBRF (PIR), PHYLIP interleaved multiple alignment, SWISSPROT, Plain text, etc

No. of sequence in one file

One file can only have one sequence.

One file can have multiple sequence.

3rd party package included

FASTA, BLAST FASTA, BLAST, Assembly program not included. They must be run separately

Upper limit of sequence size

35K 2G

Page 13: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Replacement of GCG programs

Exchanging sequences between packages

In GCG In EMBOSS

getseq Newseq

Fromfasta, tofasta, fromembl, toembl

From…, to… (any program that reads/writes sequences)

seqret

Page 14: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Replacement of GCG programs Sequence editing, manipulation and display

In GCG In EMBOSSfetch Seqret

Seqed

command delete

command insert

No complete solution yet

cutseq

pasteseq

lineup No good solution yet

assemble union

shuffle shuffleseq

reverse Revseq

chopup Not needed as EMBOSS reads ‘any’ format

publish Showseq, prettyseq

Page 15: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Replacement of GCG programs

Sequence comparison and alignment

In GCG In EMBOSScompare+dotplot (default (window stringency))

Compare+dotplot (word=n)

Dotmatcher

dottup

Gap Needle, stretcher (for long sequences)

bestfit Water, matcher (for long sequences)

Pileup, clustal Emma (=CLUSTAL)

pretty Cons, showalign

Translation

In GCG In EMBOSStranslate transeq

Page 16: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Replacement of GCG programs Patterns and gene finding

In GCG In EMBOSS

Findpatterns Fuzznuc, fuzztrans, fuzzprot

NB: uses PROSITE syntax (not GCG) to define pattern

motifs Patmatmotifs

NB: ps_scan searches also PROSITE profiles

codonpreference Syco, wobble

Page 17: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Replacement of GCG programs Phylogeny

In GCG In EMBOSSdistances+growtree Ednadist or eprotdist+ eneighbor

In GCG In EMBOSSMap-With option “Find translationally silent potential restriction sites”-With option options 3’ or 5’ overhang

Remap, restrict

Silent

restover

Mapsort

Mapsort+plasmidmap

Restrict

Cirdna (only partial solution: input file with Tick positions must be created “manually”

Mapping

Page 18: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Replacement of GCG programs

Protein analysis

In GCG In EMBOSSPepplot, peptidestructure+plotstructure

Garnier, pepinfo, octanol, pepwindow

Primer selection

In GCG In EMBOSS

prime Eprimer3 (=Primer3)

Primepair, melttemp No good solution yet

Page 19: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Replacement of GCG programs Keyword-based databank searching

In GCG In EMBOSS

Names Whichdb

Indexsearch Indexsearch

Stringsearch (mode A)

Stringsearch (mode B)

Textsearch

No good solution yet but advantageously replaceable by indexsearch

Page 20: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Running EMBOSS program EMBOSS programs are run by typing them

at the Unix prompt, or by using an interface.

The EMBOSS command syntax follows normal Unix command conventions.

Programname -help to get some help on the options.

Programname -opt to make the program prompt you for common

options. tfm programname

to get the full help on a program.

Page 21: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Login bioinfo Login bioinfo with ‘telnet bioinfo.hku.hk’ If you are using the temp account, please create a

directory of your username at hkusua: bioinfo% mkdir <username> E.g. bioinfo% mkdir chantaiman

Change directory to your created directory Bioinfo% cd <username> E.g. bioinfo% cd chantaiman

Page 22: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

wossname

It is easy to forget the name of a program.

To find EMBOSS programs, use wossname

wossname finds programs by looking for keywords in the description or the name of the program.

Page 23: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

wossname Type wossname at the Unix % prompt

bioinfo % wossname Displays one-line description. Prompts you for information:

Finds programs by keywords in their one-line documentation Keyword to search for: restrict

SEARCH FOR 'RESTRICT’recode Remove restriction sites but maintain the

same translationremap Display a sequence with restriction cut

sites, translation etc…..

Page 24: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Optional parameters To get prompted for all the optional parameters, type

the following:

bioinfo % wossname -optFinds programs by keywords in their one-line

documentationKeyword to search for: proteinOutput program details to a file [stdout]: myfileFormat the output for HTML [N]: String to form the first half of an HTML link:String to form the second half of an HTML link:Output only the group names [N]:Output an alphabetic list of programs [N]:Use the expanded group name [N]:

Page 25: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

helpbioinfo % wossname -help Mandatory qualifiers:

[-search] string Enter a word or words here.

Optional qualifiers (* if not always prompted):

-outfile outfile this program will write the program names

Advanced qualifiers:

-[no]emboss bool EMBOSS program

documentation will be searched.

Mandatory - required, are often parameters (in ‘[]’) Optional - use -opt to be prompted for these. Advanced - things that are not often used!

Page 26: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Writing to the screen

Note that the default output file for wossname was:stdout (Standard output)

Use this whenever prompted for an output file.

This is a ‘magic’ file name. It displays the output on the screen,

not a file.

Page 27: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Working with sequences

EMBOSS reads sequences from files or databases.

It automatically recognizes the input sequence format.

You can easily specify many output formats.

Page 28: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Getting sequences from the databases Database single entry (ID)

database:entryFor example embl:hsfau

Wildcarded entries (Query)database:hs*For example sw:fos_*

All entriesdatabase:*

Most databases will support all 3 methods - some may not.

Page 29: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

showdbbioinfo% showdb

Displays information on the currently available databases

# Name Type ID Qry All Comment

# ==== ==== == === === =======

domo P OK OK OK DOMO sequences

enspep P OK OK OK ENSEMBL PEP sequences

gp P OK OK OK GENPEPT sequences

gpnew P OK OK OK New GENPEPT sequences

kabatp P OK OK OK KABAT Protein sequences

nrl P OK OK OK NRL_3d

pdb P OK OK OK PDB sequences

pir P OK OK OK PIR using NBRF access for 4 files

rem P OK OK OK REMTREMBL sequences

Page 30: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

seqret Reads in a sequence, and writes it out.

bioinfo % seqretReads and writes (returns) a sequence

Input sequence: embl:xlrhodop Output sequence [xlrhodop.fasta]:

bioinfo % more xlrhodop.fasta

>XLRHODOP L07770 Xenopus laevis rhodopsinggtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaaaaa

agaaacacagaaggcattctttctatacaagaaaggactttatagagctgctaccatgaa

cggaac . .

Page 31: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

seqret from the command line

Give seqret all of its data on the command-line.

It doesn’t need to prompt for anything else.

bioinfo % seqret embl:xlrhodop -outseq xlrhodop.fasta

The ‘-outseq’ can be abbreviated to ‘-out’. Any abbreviation must be unique. Even shorter, leave out the qualifier:bioinfo % seqret embl:xlrhodop xlrhodop.fasta

Page 32: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Changing output formats (reformatting)seqret can reformat sequences by

specifying the output format:

bioinfo % seqret embl:xlrhodop xlrhodop.gcg -osformat gcgbioinfo % more xlrhodop.gcg

!!NA_SEQUENCE 1.0Xenopus laevis rhodopsin mRNA, complete cds.XLRHODOP Length: 1684 Type: N Check: 9453 .. 1 ggtagaacag cttcagttgg gatcacaggc ttctagggat

cctttgggca 51 aaaaagaaac acagaaggca ttctttctat acaagaaagg

actttataga . .

Page 33: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Multiple sequences, single filesYou can use seqret to retrieve multiple

sequences into a file:

bioinfo% seqret “sw:opsd_a*” opsd_a.seqs

This retrieves all the sequences whose identifiers start with “opsd_a” into a file called opsd_a.seqs.

Page 34: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Multiple sequences, many files If you wish to write one sequence per

file, use:bioinfo % seqret “sw:opsd_a*” -ossingle

The output filenames will be based on the sequence entry names.

The program seqretsplit will split an existing multiple sequence file into many files.

Page 35: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Asterisk on the command line You can't use a ‘*’ on the UNIX command-

line. UNIX tries to match it to filenames. Use it quoted, either with quotes or a

backslash:"embl:*"embl:\*

For example:bioinfo % seqret “embl:hsf*” hsf.seq

Page 36: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

EMBOSS web interface On the left, you can choose the program to run. You

can also see all the program sorted alphabetically instead of sorted by group by clicking on the link.

Page 37: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Getting help in EMBOSSHelp on the program is available by

clicking on the question mark.

Page 38: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Input to EMBOSS If you know the entry_name or accession number,

enter the sequence in the Uniform Sequence Addresses (USAs) format E.g. embl:xlrhodop

Page 39: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Input to EMBOSS If you have your own sequence file,

upload the sequence by clicking the browse button.

Page 40: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Input to EMBOSSYou can also copy and paste your

own sequence into the text area.

Page 41: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

seqret web interface

E.g. seqret - retrieving single sequenceInput:

USA path embl:xlrhodopOutput file format: GCG 9.x/10.x

Output:The sequence retrieved in GCG

format

Page 42: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

seqret

Page 43: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

seqret

Page 44: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

seqret Seqret – retrieving multiple sequences

Input: sw:ops2_*. Output file format: Pearson FASTA Output: multiple sequences with the identifier starting with

sw:ops2_. Save the file as ops2.fasta by right clicking on the link

Page 45: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

coderet Extract CDS, mRNA and translations from feature

tables. If any sequences are in other entries of that database, they are automatically fetched and incorporated correctly into the final sequence.

Input: embl:X03487

Page 46: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

coderetOutput

Page 47: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

dottup dottup – Comparison between 2 sequences using

dot-plots. Input:

1st sequence: embl:xl23808 (Xenopus laevis rhodopsin gene)

Second sequence: embl:xlrhodop (Xenopus laevis rhodopsin cDNA from complement of mRNA)

Output:A dotplot showing the diagonal lines

representing areas where the two sequences align well in PNG format.

The image can be saved into the computer.

Page 48: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

dottup

Page 49: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

dottup The 5 diagonal lines represent areas where the two

sequences align well. Since this is aligning genomic and cDNA, the five diagonals

represent the five exons of the gene.

Page 50: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Pairwise Sequence Alignment

An alignment is an arrangement of two sequences which shows where the two sequences are similar, and where they differ.

There is no unique, precise, or universally applicable notion of similarity.

Page 51: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Global Alignment

A global alignment is one that compares the two sequences over their entire lengths, and is appropriate for comparing sequences that are expected to share similarity over the whole length.

The alignment maximizes regions of similarity and minimizes gaps using the scoring matrices and gap parameters provided to the program.

Page 52: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

needleFunction

Needleman-Wunsch global alignmentDescription

This program uses the Needleman-Wunsch global alignment algorithm to find the optimum alignment (including gaps) of two sequences when considering their entire length.

The computation is rigorous.It can be time consuming to run if the

sequences are long.

Page 53: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Input sequence for needle

Page 54: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

needle needle - Needleman-Wunsch global alignment

Input:1st sequence: embl:xlrhodop, 2nd sequence: embl:xl23808 Output: Global alignment showing the 5 aligned regions.

Page 55: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Local alignment

Local alignment searches for regions of local similarity and need not include the entire length of the sequences.

Local alignment methods are very useful for scanning databases or other circumstances when you wish to find matches between small regions of sequences, for example, between protein domains.

Page 56: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

water

FunctionSmith-Waterman local alignment.

DescriptionWater uses the Smith-Waterman

algorithm (modified for speed enhancements) to calculate the local alignment.

Page 57: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

water water - Smith-Waterman local alignment.

Input:1st sequence: embl:xlrhodop, 2nd sequence: embl:xl23808 Output: Local alignment showing the 5 aligned region.

Page 58: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Multiple Sequence AnalysisMultiple sequence alignments are used

To find patterns to characterize protein families.

To detect or demonstrate homology between new sequence and existing families of sequences.

To help predict the secondary and tertiary structures of the new sequences.

As an essential prelude to molecular evolutionary analysis.

Page 59: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

emmaFunction

Multiple alignment program - interface to ClustalW program

Description EMMA calculates the multiple alignment

of nucleic acid or protein sequences according to the method of Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). This is an interface to the ClustalW distribution.

Page 60: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Upload file to emma Input: output from seqret (ops2.fasta) retrieving all

swissprot sequences whose identifiers begin with sw:ops2_*

Click on browse button to upload the file ops2.fasta

Page 61: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Input sequence to emma ops2.fasta

Page 62: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

emma emma – interface to ClustalW program

Output: multiple alignment saved as file ops2.aln.

Page 63: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

prettyplot Prettyplot – displays aligned sequences, with colouring and

boxing Input: output from program emma ops2.aln Output: graphic display of aligned sequences. Identical residues in

red, similar residues in green.

Page 64: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

prophecyFunction

Creates matrices/profiles from multiple alignments

Description This creates a profile matrix file from a

nucleic acid or a protein sequence alignment.

The profile matrix file can then be used by program profit or prophet.

Page 65: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

prophecy Input:

Sequence: output from program emma ops2.aln

Select type: Gribskov

Page 66: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

prophecy Output: A profile to be saved as ops2.prophecy.

This profile allows a new sequence to be aligned optimally to a family of similar sequences in the program prophet.

Page 67: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

prophet Prophet – Gapped alignment for profiles

Input: Input sequence: The file xlrhodop.pep, output from

transeq of the sequence embl:xlrhodop from 110-1171 region.

Profile or matrix file: ops2.prophecy Output file: ops2.prophet

Output: The gapped alignment to profile. The vertical bars (|) represent residues that are identical between the ops2 consensus and our rhodopsin, while the colons (:) represent conservative substitutions. Aligning members of a family can reveal conserved regions that may be important for structure and/or function.

Page 68: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

prophetOutput

Page 69: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

plotorf plotorf – plots potential opening reading frames

Input sequence: embl:xlrhodop Output: graphical output showing the potential opening reading

frames in all six frames. The longest protein is in second frame. The correct open reading frame is the second frame.

Page 70: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

getorf getorf - Finds and extracts open reading frames (ORFs)

Input: Sequence: embl:xlrhodop Type of sequence to output: Nucleic sequence between

START and STOP codons Output: Textual information of the region and the sequence of that

region.

Page 71: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

transeq transeq - Translate nucleic acid sequences

Input: sequence: embl:xlrhodop regions to translate: 110-1171 (from information of getorf)

Output: Translated sequence of the given region. Save the file as xlrhodop.pep

Page 72: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 1 Q1 Align HER2 _ERB2_HUMAN and

UNKNOWN_AAL39899.1 with needle and water. What is the main difference between the two types of alignment in these two cases (the files HER2-fasta.prt and ALL39899_1.prt are at http://bioinfo.hku.hk/tutorial/)?

Repeat the Smith-Waterman alignment of HER2-fasta.prt and ALL39899_1.prt with different parameters. What happens if gap penalties are changed to 30 and 2 instead of the defaults 10 and 0.5?

BLOSUM62 is default. What happens to the local alignment (using program water) when using other matrices, e.g. EPAM10?

Page 73: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 1 Q2

Type gb:A7120FTSZ in the text box and run seqret. Run entret with the same sequence USA and examine the entry. What is the difference between the two entries?

Page 74: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 1 Q3

With the program infoseq, display information on all sequences whose name starts with ‘10’ in the SwissProt database. (hint: the sequence is sw:10*, choose the information you want to display by changing to ‘yes’)

Page 75: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 1 answer (A1) Needle output

Page 76: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 1 answer (A1) Water output

Page 77: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 1 answer (A1) Water output with gap opening penality of

30 and gap extension penality of 2.

Page 78: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 1 answer (A1) Water output with matrix of EPAM10

Page 79: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 1 answer (A1) The global alignment (needle) require the whole

sequences to be aligned. The % identity and % similarity is much less than local alignment (water).

If the gap penalties are changed to 30 and 2, no gap appears in the alignment

If EPAM10 is used, the score and alignment length drops. Since PAM is derived from global alignment, it gives worser result for the local alignment program water. EPAM10 is more suitable for very similar protein with no more than 10% evolutionary divergent.

Page 80: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 1 answer (A1)

Amino Acid substitution matrices PAM (percent accepted mutation) – lists the

likelihood of change from one amino acid to another in homologous sequences during evolution.

One PAM is a unit of evolutionary divergence in which 1% of the amino acids have been changed.

some amino acid substitutions occurred more readily than others, probably because they did not have a great effect on the structure and function of a protein.

Page 81: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 1 answer (A1)

Amino Acid substitution matrices (con’t) BLOSUM – matrix values are based on a large

set of ~2000 conserved amino acid patterns called blocks. Blocks come from a database of protein sequences representing more than 500 families of related proteins.

PAM is derived from global alignments of proteins, while BLOSUM comes from alignments of shorter sequences.

The matrix built from blocks with no more than x% of similarity is called BLOSUM X

Page 82: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 1 answer (A1) PAM100 ==> Blosum90 PAM120 ==> Blosum80 PAM160 ==> Blosum62 PAM200 ==> Blosum52 PAM250 ==> Blosum45 The Blosum matrices are best for detecting

local alignments. The Blosum62 matrix is the best for

detecting the majority of weak protein similarities.

The Blosum45 matrix is the best for detecting long and weak alignments.

Page 83: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 1 answer (A1) If the BLOSUM62 matrix is compared to PAM160

then it is found that the BLOSUM matrix is less tolerant of substitutions to or from hydrophilic amino acids, while more tolerant of hydrophobic changes and of cysteine and tryptophan mismatches.

Page 84: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 1 answer (A2)seqret output

Page 85: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 1 answer (A2)entreq output

Page 86: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 1 answer (A2)

You will see the sequence for the Anabaena 7120 ftsZ and gsh-III genes.

EMBOSS is also capable of extracting more information than just the sequence from a database entry. The program entret will return the entire entry as a text file.

Page 87: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 1 answer (A3) Output

Page 88: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

garnier Garnier - Predicts protein secondary structure using the

Garnier-Osguthorpe-Robson (GOR)  method Secondary structure prediction is notoriously difficult to do

accurately. The GOR I alogorithm is one of the first semi-successful methods.

The Garnier method is not regarded as the most accurate prediction, but is simple to calculate on most workstations. Input: translated sequence (xlrhodop.pep) embl:xlrhodop

from 110-1171 region with program transeq. Output: Predicted protein secondary structure

Page 89: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

garnierOutput

Page 90: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

pepinfo pepinfo - Plots simple amino acid properties in parallel.

Input sequence: translated sequence (xlrhodop.pep) embl:xlrhodop from 110-1171 region with program transeq.

Output: A textual and graphical representation of amino acid properties (size, polarity, aromaticity, charge, etc). Hydrophobicity profiles useful for locating turns, potential antigenic peptides and transmembrane helices.

Page 91: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

pepinfo Showing the residues distribution

Page 92: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

pepinfo Hydrophobicity profiles are useful for locating turns, potential

antigentic peptides and transmembrane helices. positive score -> a hydrophobic region. negative score -> hydrophilic region. show seven highly hydrophobic regions. use the program tmap to investigate further.

Page 93: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

patmatmotifs Patmatmotifs – search a PROSITE motif

database with a protein sequence. It can identify to which known family of protein (if any) the new sequence belongs.

PROSITE currently contains patterns and profiles specific for more than a thousand protein families or domains.

PROSITE patterns (Biologically significant amino acid patterns can be summarized in the form of regular expressions)

PROSITE profile (techniques based on weight matrices allows the detection extreme sequence divergence protein families and functional/structural domains)

Page 94: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

patmatmotifs Input sequence: The file xlrhodop.pep, which is output

from transeq of the sequence embl:xlrhodop from 110-1171 region.

Output: A textual representation showing where the sequence match with a motif.

Page 95: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

pscan Pscan – Scans proteins using PRINTS PRINTS is a database of diagnostic protein

signatures, or fingerprints. Fingerprints are groups of conserved motifs

or elements that together form a diagnostic signature for particular protein families.

An uncharacterised sequence matching all motifs or elements can then be readily diagnosed as a true match to a particular family fingerprint.

Input sequence: The file xlrhodop.pep, which is output from transeq of the sequence embl:xlrhodop from 110-1171 region.

Page 96: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

pscanOutput: A textual representation showing where the

short sequences match with the PRINTS database that defines functional protein families.

Page 97: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

fuzznuc fuzznuc uses PROSITE style patterns to

search nucleotide sequences. Letter code for pattern

[ACG] stands for A or C or G. {AG} stands for any nucleotides except

A and G. N(3) corresponds to N-N-N, N(2,4)

corresponds to N-N or N-N-N or N-N-N-N. [CG](5)TG{A}N(1,5)C

Input: sequence: embl:hhtetra Pattern: AAGCTT

Page 98: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

fuzznucOutput

Page 99: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 2 Q1

Use tmap to displays membrane spanning regions with the input sequence of xlrhodop.pep ( translated with program transeq from embl:xlrhodop at 110-1171 region). Does the result agree with pepinfo?

Page 100: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 2 Q2

Use fuzzpro to search sequence: CREAp_m.txt pattern: CXXXXC (the file CREAp_m.txt is from http://bioinfo.hku.hk/tutorial/)

Page 101: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 2 Q3Use patmatmotifs to find pattern in

swissprot sequences fos_human or fos_rat, and use these pattern to do fuzzpro. Search other fos genes of different organisms. (Hint: Use sw:fos_human for the input; Other organisms: bovin, chick, mouse, sheep.)

Page 102: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 2 Q4 Sometimes it is better to run the

program fuzznuc in command line because more parameters can be given

In the BIOINFO terminal, type the following (you must put the command in one line in the UNIX prompt):

bioinfo% fuzznuc -sequence=embl:hhtetra-pattern=AAGCTT -mismatch=1 -

complement-outf=outf.out

How is the result different from previous run in web interface?

Page 103: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 2 answer (A1) Bars are displayed in the plot above the regions

predicted as being most likely to form transmembrane regions

May be seven transmembrane helices in this protein.

Result agree with pepinfo.

Page 104: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 2 answer (A2)

The symbol ‘x’ is used for a position where any amino acid is accepted.

There, the pattern CXXXXC matches the result patterns of CQFPGC and CMFPGC.

Page 105: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 2 answer (A2) Patmatmotifs output using sw:FOS_HUMAN

Page 106: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 2 answer (A3)When run with patmatmotifs, the

sequences sw:FOS_HUMAN and sw:FOS_RAT returns the same motifs of AMIDATION, LEUCINE_ZIPPER, and BZIP_BASIC.

When run with fuzzpro with one of the pattern, the start and end position agrees with patmatmotifs.

Page 107: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 2 answer (A3) Fuzzpro output with pattern

“GRAQSIGRRGKVEQ” and sequence sw:fos_human

Page 108: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

Exercise 2 answer (A4) You can add no. of mismatches in input

parameters for command line. The result with 1 mismatch can now be shown

Page 109: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

cpgplot CPGPLOT – Plot the CpG rich areas CpG refers to a C nucleotide immediately

followed by a G. The 'p' in 'CpG' refers to the phosphate group linking the two bases.

By default, this program defines a CpG island as a region where over an average of 10 windows, the

calculated % composition is over 50% and the calculated Obs/Exp (i.e.

Observed/Expected) ratio is over 0.6 and the conditions hold for a minimum of 200

bases. These conditions can be modified by

setting the values of the appropriate parameters.

Page 110: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

cpgplot

The Observed number of CpG patterns in a window is simply the count of the number of times a 'C' is found followed immediately by a 'G'.

The Expected frequency of CpG's in a window is calculated as the number of 'C's in the window multiplied by the number of 'G's in the window, divided by the window length.

Expected = (number of C's * number of G's) / window length

Page 111: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

cpgplot Input: embl:rnu68037Output

Page 112: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

cpgplotOutput

Page 113: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

cusp

CUSP reads one or more coding sequences (CDS sequence only) and calculates a codon frequency table.

It is important to use a codon frequency table that is appropriate for the species that your protein comes from.

Input: Seq: embl:paamir Codon usage table: Default (Ehum.cut)

Page 114: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

cusp Output: Fract – the faction of all amino acids coded for

this codon triplet. /1000 – the number of codons per 1000 bases

Page 115: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

cusp

Running the program in command line allows you to specify the sequence begin and sequence end

bioinfo% cusp -sbeg 135 -send 1292

Create a codon usage table

Input sequence(s): embl:paamir

Output file [paamir.cusp]:

Page 116: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

cusp bioinfo% more paamir.cusp

Page 117: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

hmoment hmoment plots or writes out the

hydrophobic moment. Hydrophic moment is the hydrophobicity of a peptide measured for a specified angle of rotation per residue.

Assumption: The angle of rotation (bonds of the backbone and amino acid side-chains) per residue in alpha helices is 100 degrees. The angle of rotation per residue in beta sheets is 160 degrees.

Input: Sequence:sw:hbb_human Produce graph: yes Plot two graph: yes

Page 118: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

hmomentOutput:

one for the alpha helix moment and one for the beta sheet moment.

Page 119: HKUHKU Computer Centre Introduction to EMBOSS Christine Ho chrisho@cc.hku.hk

HKU

ComputerCentre

End of lectureThank you!