gbug feb09 cramer
TRANSCRIPT
Building Your Own Gene MachineWith Unix/Linux
Robert A. Cramer Jr., Ph.D.Department of Veterinary Molecular Biology
Montana State University
Seminar Purpose
• YOU ….. CAN …….. DO ……. IT!!!Shhhhhhh ….. AND YOU SHOULD!
Oh My %*&#…… NOT THE COMMAND LINE!
Why?• If you work in biology and use
molecular/genomics tools ……. You Have To!(One way or another)
• Independence ….. Do it yourself!
• Convenience … anytime, anywhere
• GUIs on the internet have their limitations– But you probably already know that
• Fun?
My Story … I’m not abioinformatician .. But
• 5,000 ESTs from a mixed-infection library ….. What to do?
• I wanted to graduate before 2020, so analyzingone sequence at a time was not going to cut it…….. !
• No “cluster” informatic resources available to me…… more or less on my own ….
• Hello Command Line … Hello UNIX ….. Hello MAC
Building Your Gene Machine• Step 1: Become Familiar with Unix
Commands (Or Linux if you prefer PCs)– Intimidating part for most ….. But it is painless ….
Really ……. Okay, maybe just a bit …. :-)
• Step 2: Install Basic Informatics Software– Most Scientists Try and Start Here then Proceed
to 1 :-)
• Step 3: Trial and Error … Yes, can I havesome more CT Drill Sergeant? Well, yes, youmust!
Unix (On the almighty Mac)• OSX is a flavor Unix - So is Linux
– Windows is DOS based ….. Ugh.– MAC gives you best of both worlds.
• Terminal - direct link to the computer - you are theboss! Under - /Applications/Utilities on MAC
• X11 on Macs - can install from Developer Tools Discthat comes with all Macs (Encourage you to install, notall open source software comes in binary form! Includeslatest gcc compiler). In Applications. Allows you to rungraphical X programs (like PHYLIP or CLUSTALX).
• Linux on the PC - Many flavors, RedHat Fedora is Free--- I learned on PC running RedHat Linux
Unix Basics• The SHELL - command interpreter
– BASH most popular, followed by csch or tcsh; I usetcsh, why? I learned it first.
• Hierarchical system**– Directories (like folders on Windows or Mac)– Sub-Directories– Files– KNOW WHERE YOU ARE!!!! Key Unix Concept
• Unix Commands all lowercase - Unix is case sensitive• Unix Command: pwd - show current working directory• Unix Command: cd - change directory• When you start-up terminal you are in your HOME directory• Unix Command: ls - lists what’s in the current directory
• Unix Commands - Easy to find, just use “the google”– http://www.cs.drexel.edu/~kschmidt/Ref/unix_reference.html
THE COMMAND
• man “command”– Will bring up manual for any Unix
command telling you how to use it andwhat it is used for
– Wow, how user friendly!
The Biggest Mistake ….• Most common mistake beginning Unix users make is not
understanding the concept of working directories and PATH
• To execute a program you MUST be in the directory theprogram is installed– Computers are STUPID!!!! You MUST tell them everything (with
no syntax errors).
• UNLESS …. You set your PATH– Log in file that tells stupid computer where to look when you run
commands– .tcsh, .cshrc, etc. etc.– Editors ….. Can edit your login file or any file for that matter, I use
vi or pico– Editors have their own sets of commands …again GOOGLE! Or
buy a book!
Path
From: http://www.dartmouth.edu/~rc/classes/unix1/print_pages.shtml
Second Biggest Mistake …
• Directory and File Permissions!– Unix is very secure, but you have to be aware of your
permissions when installing software and writing files todirectories
• ROOT user always has permission– So many software installs are done as ROOT– If you try and install a program, or make a new directory and
an error comes back telling you that you do not havepermission, you know why!
Permissions
Modified From: Kschmidt, Drexel
chmod command can modify permissions
Third ….• File formats …… really, a lot of bioinformatics
is manipulating sequence files into correctformats.
• Common Complaint: Student to Instructor:“I keep trying to run my protein sequence in alocal blast but it does not work. I don’t knowwhy, I got my sequence from NCBI and cutand paste it into Microsoft Word, saved it andnow BLAST does not work”
Files …..
>YDR044W Chr 4 MPAPQDPRNLPIRQQMEALIRRKQAEITQGLESIDTVKFHADTWTRGNDGGGGTSMVIQDGTTFEKGGVNVSVVYGQLSPAAVSAMKADHKNLRLPEDPKTGLPVTDGVKFFACGLSMVIHPVNPHAPTTHLNYRYFETWNQDGTPQTWWFGGGADLTPSYLYEEDGQLFHQLHKDALDKHDTALYPRFKKWCDEYFYITHRKETRGIGGIFFDDYDERDPQEILKMVEDCFDAFLPSYLTIVKRRKDMPYTKEEQQWQAIRRGRYVEFNLIYDRGTQFGLRTPGSRVESILMSLPEHASWLYNHHPAPGSREAKLLEVTTKPREWVK*
Text File From A Text Editor: This is GOOD
??^Q�^Z?^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@>^@^C^@??^@^F^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@%^@^@^@^@^@^@^@^@^P^@^@'^@^@^@^A^@^@^@????^@^@^@^^@$^@^@^@??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????^@~Gb^D^@^@?^R?^@^@^@^@^@^A^Q^@^A^@^A^@^F^@^@b^G^@^@^N^@jbjb^B?^B?^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@..^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@>YDR044WChr 4^MMPAPQDPRNLPIRQQMEALIRRKQAEITQGLESIDTVKFHADTWTRGNDGGGGTSMVIQD^MGTTFEKGGVNVSVVYGQLSPAAVSAMKADHKNLRLPEDPKTGLPVTDGVKFFACGLSMVI^MHPVNPHAPTTHLNYRYFETWNQDGTPQTWWFGGGADLTPSYLYEEDGQLFHQLHKDALDK^MHDTALYPRFKKWCDEYFYITHRKETRGIGGIFFDDYDERDPQEILKMVEDCFDAFLPSYL^MTIVKRRKDMPYTKEEQQWQAIRRGRYVEFNLIYDRGTQFGLRTPGSRVESILMSLPEHAS^MWLYNHHPAPGSREAKLLEVTTKPREWVK*^M^M^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
Same file from Word: Gee, wonder why this does not work?
Okay Already …. Your Gene Machine
• This is just an intro! The software you caninstall on your own personal gene machine isvirtually limitless these days … install what youneed.
• These are some of the basic essentials that Iuse routinely to analyze genomic sequence data– BLAST - NCBI or Wash U- Emboss - Must have- HMMER - Hidden Markov Models for Gene Finding- Prosite - Patterns and Profiles from proteins- FINK - Incredible resource for MAC users (another
reason to use a MAC if you do a lot of informatics)
Installing Local BLAST
• NCBI FTP site - on NCBI home page• Download appropriate version for your flavor
of Unix!• Know where you install it
– Completely up to you!• Some people install all programs (executables) in the
directory /usr/local/bin• Some people install programs in their own respective
directories I.e. /Users/rcramer/BLAST
• Regardless, you should make sure yourinstallation directory is in YOUR PATH
Now the Installation
• Unpack the file in your favorite directory!– *you may need to do this as root user if you get
an error saying you do not have permission– rcramer% mkdir /usr/local/bin or sudo mkdir /usr/local/bin as root– rcramer% mv /Users/rcramer/Desktop/blastetc.tar.gz /usr/local/bin– rcramer% cd /usr/local/bin– rcramer% gunzip blastetc.tar.gz | tar xf -– Follow the UNIX install and testing of the installation instructions in the
README.bls file– You’ll know its working if you type:– rcramer% blastall
• And get a list of various options– Don’t forget to set your path in your .cshrc file!
vi .cshrcset path= ( /Users/rcramer/blast/blastetc/bin ${path})
Step 2- BLAST Databases
• The power of local BLAST is you can install multiple genomedatabases or any type of sequence database that you useroutinely!– Databases can be obtained at NCBI or your
favorite organisms genome homepage– Usually in FASTA format
• Use the formatdb command to format your database– Make sure you format it correctly, protein or
nucleotide!– formatdb -i afu_peptides.seq -p T -o T
Advantages of Local Blast• Can make your own BLAST databases
• Can run “batch blast” I.e. many sequences atthe same time and not compete with otherson the internet server
• Can do BLAST searches where ever, whenever, regardless of whether you have internetaccess
• Control --- can control the output, many manyoptions!!! (Important for downstreamanalyses)
EMBOSShttp://emboss.sourceforge.net/
• Comprehensive sequence analysis tool-kit
• Contains Hundreds of sequence analysis programs
• All free!!
• Can be run from command line, allows you to “Script” togetherseveral programs at a time (real analysis power when you startdoing this)
• Several GUIs are also available to download and install
• Step 1: Acquire Latest Release• Step 2: Install According to Instructions
– Remember your permissions (root), PATH– http://emboss.sourceforge.net/docs/adminguide/node8.html
• Step 3: Test Run!
Example EMBOSS Install• Download EMBOSS-3.x.x.tar.gz• Create directory you want to install emboss in: *Do this as ROOT
– rcramer # mkdir /Users/rcramer/emboss– rcramer # mv EMBOSS-3.x.x.tar.gz /Users/rcramer/emboss– rcramer # gunzip EMBOSS-3.x.x.tar.gz– rcramer # tar -xf EMBOSS-3.x.x.tar.gz– This last step makes a NEW DIRECTORY EMBOSS-3.X.X– rcramer # cd /Users/rcramer/emboss/EMBOSS-3.X.X– rcramer # ./configure
• ** You ned a gcc compiler installed!!!– rcramer # make– rcramer # make install
• Make sure you SET your PATH in your .cshrc file!– I.e. set path= ( /Users/rcramer/emboss/EMBOSS-5.0.0/emboss/ ${path})
• Some EMBOSS applications use GUIs, you need to set the PLPLOTenvironmental variable AND have X windows interface (MAC USERS = X11)
– In your .cshrc file: setenv PLPLOT_LIB /Users/rcramer/emboss/EMBOSS-5.0.0/plplot/lib
Wossname is your EMBOSS friend• Try running wossname
– rcramer % wossname restrictSEARCH FOR 'RESTRICT'
recoder Remove restriction sites but maintain same translationredata Search REBASE for enzyme name, references, suppliers etcremap Display sequence with restriction sites, translation etcrestover Find restriction enzymes producing specific overhangrestrict Finds restriction enzyme cleavage sitesshowseq Display a sequence with features, translation etcsilent Silent mutation restriction enzyme scan
• Can you find a program to:
• Display multiple alignments - Yes• Find ORFs (Open Reading Frames) - Yes• Translate a sequence - Yes• Find restriction enzyme sites - Yes• Find the isoelectric point of a protein - Yes• Do global alignments - Yes• Write your dissertation - No
EMBASSY
• A group of programs similar to EMBOSS butkept separately. So need to install separately:– HMMER, MEME, TOPO, PHYLIP, and more!
• Detailed installation instructions for bothEMBOSS and EMBASSY:
http://emboss.sourceforge.net/docs/adminguide/admin.html
Your Gene Machine• If you install BLAST with your favorite databases ..• EMBOSS Package• EMBASSY Package
You’ve created a very powerful and useful personalgene machine that you can use anywhere,
anytime!
Of course there is much more available. ClustalW,Prosite, MUSCLE, PHRED, PHRAP, etc. etc.
What you put on your Gene Machine is up to you
Last - Maybe Most Important?http://www.finkproject.org/
An absolute must to have installed if you are MACUSER (and you should be if you do a lot of
informatics!)
Fink Packages
Remember …..• You have to engage the command line
– You will fail, but the computer will always tell youwhat is wrong. So try again! (Don’t forget about“the google”)
• PERMISSIONS• PATH• ENVIRONMENT• FILE FORMAT
• Most of the time you will fail because one ofthe above 4 is not right
Some Resources• Each program will have a manual, often just running
the program w/o any arguments will bring up all thepossible options and tell how the correct syntax
• Introduction to Unix:– Just google this, LOTS of webpages with basic
Unix commands, lectures etc.– MSU Bioinformatics Core Facility - Intro to Unix
Class, Computational Cluster, etc.
• Books - lots of good intro to Unix books out there,O’Reiley Series.
Let’s take the Gene Machine for a test drive
• Non-ribosomal PeptideSynthetase Gene
• New Sequenced Genome• How many NRPS does it
have?– Simple Right? Yes, but ….– Multiple domains make
BLAST search inconclusive– But BLAST will narrow the
field– HMMER or PROSITE can
give definitive number byexamining domains
• All done in a matter ofminutes while you watch“The Office”
Do I have to dothis with one
sequence at atime? NO!!