layout by orngjce223, cc-by custom blast databases a primer shawn houston [email protected] uaf...

17
Layout by orngjce223, CC-BY Custom BLAST Databases A Primer Shawn Houston [email protected] UAF Life Science Informatics

Upload: norman-skinner

Post on 17-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Layou

t b

y o

rng

jce2

23

, C

C-B

Y

Custom BLAST DatabasesA Primer

Shawn Houston [email protected] Life Science Informatics

Layou

t b

y o

rng

jce2

23

, C

C-B

Y

Custom BLAST Databases

Why? To limit your search domain To use your unique sequences Automate your blast searches

Pipeline Workflow

How? Linux

It's what I do...

Layou

t b

y o

rng

jce2

23

, C

C-B

Y

Custom BLAST Databases

What do I need? Input in either FASTA or ASN.1 format

I will focus on FASTA NCBI Toolkit

formatdb BLAST binary downloads include formatdb

formatdb [-] [-B filename] [-F filename] [-L filename] [-T filename] [-V] [-a] [-b] [-e] [-i filename] [-l filename] [-n str] [-o] [-p F] [-s] [-t str] [-v N]

DESCRIPTION formatdb must be used in order to format protein or nucleotide source databases before these databases can be searched by blastall, blastpgp or MegaBLAST. The source database may be in either FASTA or ASN.1 format. Although the FASTA format is most often used as input to formatdb, the use of ASN.1 is advantageous for those who are using ASN.1 as the common source for other formats such as the GenBank report. Once a source database file has been formatted by formatdb it is not needed by BLAST. Please note that if you are going to apply periodic updates to your BLAST databases using fmerge(1), you will need to keep the source database file.

Layou

t b

y o

rng

jce2

23

, C

C-B

Y

FASTA Format

>This is an entry headeratcgtcgattgatgtcgtgatcgtagtcgtagctgatgactgtatgctgcatgtgctaaaaacatgctagct

Important NoteNCBI only considers the first 32 characters in a FASTA header significant and NCBI provided tools will decide if a sequence is unique using only these.

Layou

t b

y o

rng

jce2

23

, C

C-B

Y

The FASTA Header

>dbi|accnum| my header An NCBI Recognized Database ID

GenBank gbgb|accession|locusEMBL Data Library emb|accession|locusDDBJ, DNA Database of Japan dbj|accession|locusNBRF PIR pir||entryProtein Research Foundation prf||nameSWISS-PROT sp|accession|entry nameBrookhaven Protein Data Bank pdb|entry|chainPatents pat|country|numberGenInfo Backbone Id bbs|numberGeneral database identifier gnl|database|identifierNCBI Reference Sequence ref|accession|locusLocal Sequence identifier lcl|identifier

Layou

t b

y o

rng

jce2

23

, C

C-B

Y

The FASTA Header 2

Do not leave any space between '>' and the NCBI Database ID

gnl and lcl can be your friend fastacmd

Retrieves sequences from a blast formated database in FASTA format by accession number

Free form headers are allowed Do not forget the 32 character “limit” Some things will not work (fastacmd, etc)

Layou

t b

y o

rng

jce2

23

, C

C-B

Y

The FASTA Header 3

>gnl|mydb|seq0001| sequence 1atcgtagctagtcgatgctgtagc Uses seq0001 as accession number Indexes in database name mydb

>lcl|seq0001| sequence 1atcgtagctagtcgatgctgtagc Uses seq0001 as accession number

Layou

t b

y o

rng

jce2

23

, C

C-B

Y

But... I use Windows!

DOS file line endings CR/LF

Apple CR or LF

Linux (Unix) LF

dos2unix, tr -d '\r' < dosfile > unixfile, perl -pi -e's/\r\n/\n/g yourfile, etc.

Layou

t b

y o

rng

jce2

23

, C

C-B

Y

Formatting Your Database

Let us assume we have a text formated file containing FASTA format nucleotide sequences, myfile.fa

Let us assume we have a command line, cygwin, Apple Terminal, Linux, HP-UX, …

$ formatdb -pF -imyfile.fa What do I get?

myfile.fa.nhr, myfile.fa.nin, myfile.fa.nsq

Layou

t b

y o

rng

jce2

23

, C

C-B

Y

Formatting Your Database 2

But I am not using accession numbers or database identifiers...

$ formatdb -pF -oF -imyfile.fa This produces the same files that work in

the same way, except... No internal accession index No internal database identifier

Layou

t b

y o

rng

jce2

23

, C

C-B

Y

Using Your New Database

Copy or move myfile.fa.nhr, myfile.fa.nin, myfile.fa.nsq to their final resting place

Let's use it! We need an input sequence or sequences,

FASTA format, in one file, myseq.fa

$ blastall -pblastn -imyseq.fa -d/mypath/myfile.fa -omyblast.out

Layou

t b

y o

rng

jce2

23

, C

C-B

Y

Let's Get Some Data

You might have some data already, or NCBI

http://www.ncbi.nlm.nih.gov/ Biomirror

http://www.bio-mirror.net/ EMBL

http://www.ebi.ac.uk/embl/ DDBJ

http://www.ddbj.nig.ac.jp/

Layou

t b

y o

rng

jce2

23

, C

C-B

Y

Let's Get Some Data 2

http://xml.nig.ac.jp/tutorial/rest/index.html#l2.1use LWP::UserAgent;

$ua = new LWP::UserAgent;

# make request$req = new HTTP::Request POST => 'http://xml.ddbj.nig.ac.jp/rest/Invoke';$req->content_type('application/x-www-form-urlencoded');# set parameters$req->content('service=GetEntry&method=getDDBJEntry&accession=AB000100');

# send request and get response.$res = $ua->request($req);# If you want to get a large result. It is better to write to a file directly.# $res = $ua->request($req,'file_name.txt');

# show response.print $res->content;

Layou

t b

y o

rng

jce2

23

, C

C-B

Y

Let's Get Some Data 3

ftp://ftp.ncbi.nih.gov/genbank/genomes/Fungi/ Aspergillus_fumigatus Aspergillus_nidulans_FGSC_A4 Candida_albicans Candida_dubliniensis_CD36 Candida_glabrata_CBS138 Cryptococcus_neoformans_var_JEC21 Debaryomyces_hansenii_CBS767 ...

Layou

t b

y o

rng

jce2

23

, C

C-B

Y

Where To Go From Here

$ man formatdb $ man blastall $ blastall - HTML Documentation But, I don't have NCBI Tools installed!

Get your computer support people to do this if you can, otherwise you can download binaries from

ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.23/

Layou

t b

y o

rng

jce2

23

, C

C-B

Y

Still Going...

There are no instructions for installing NCBI binaries On Linux the BLAST data files go in /usr/share/ncbi/data

There are a lot of BLAST programs blastall Blast megablast C++ Version (blastn, blastp, etc)

Layou

t b

y o

rng

jce2

23

, C

C-B

Y

Are We Done?

Questions Comments Demo

ftp://folders.inbre.alaska.edu/FMP/BLASTdbDemo/

ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.23/

Conclusion(s) This is easy! (keep repeating until you believe)

???????