andy conley 3/26/2012 1. james kent. know that name. he is one of greatest, perhaps the greatest,...
TRANSCRIPT
1
Genome Browses and Data Display
Andy Conley3/26/2012
2
James Kent. Know that name.
He is one of greatest, perhaps the greatest, bioinformatics programmers ever.
He was deeply involved in the assembly of the public human genome project.
If you were in the fall class, you compiled the James Kent Source tree. Almost all his.
He speaks nothing but the truth.
Who is this crazy looking guy?
3
“Genome browsers facilitate genomic analysis by presenting alignment, experimental and annotation data in the context of genomic DNA sequences.”
Melissa S Cline & James W Kent, 2009
Genome browsers aggregate data
He knows what a genome browser should be
4
The UCSC Genome Browser
Clicking on any of these takes you to a page full of details CDKN2A
5
They are any kind of genomic information
Genes
Transposable element insertions
Transcription factor binding sites
Sites prone to recombination
Conservation of genomics sequences
Extremely important in modern times are tracks displaying ChIP-seq or RNA-seq data
Tracks don’t have to be genes
6
Arguably the most advanced genome browser, it is much more than a tool for looking at genomes
It integrates a huge amount of data for each gene it displays.
The UCSC also has a graphical front end for downloading from its huge backend database
What’s good about the UCSC GB?
7
It hosts the ENCODE project, one of the largest, probably the largest, assemblies of functional genomic data.
It let’s you jump between orthologous regions in different genome: CDKN2A
It’s a massive, massive database backend of over 6500 tables.
This UCSC browser does so much more
8
It’s really, really, really hard to install.
It’s impossible to understand unless you’ve tried to do it.
The UCSC genome browser works so well for the genomes that it has because it is so very, very specialized for those genomes.
Each track in the UCSC browser has been lovingly crafted.
So why aren’t there dozens of UCSC Implementations
9
A ridiculous number of genomes
They’re going to be coming out even faster in the next year or two, then faster after that.
Things like the new PacBio providing longer reads should make assembling eukaryotic genomes easier.
There are many genomes out now
10
You can’t load them/annotate them by hand – it all has to be automated.
The UCSC guys do it for the human genome because it’s the human genome.
They’re all different from each other.
You have to have some easily deployable storage/display method for your data.
How do you handle so many genomes?
11
There are a number of choices out there for a genome browser
There are really just 2 big ones: UCSCGMOD & GBrowse
We already discussed why you don’t use the UCSC browser for projects
Browser choices
12
Generic – It can handle any organism
Model Organism – Not really, whatever genome
Database – Not really a database, but there is a database in it.
GMOD just sounds good
gmod.org
Generic Model Organism Database
13
A simple, easily deployable method for storing, viewing and editing genomic data.
GMOD has many, many parts
Some of the big ones:
Apollo – EwwChado – A mechanism for storing genomic dataGBrowse – A genome browser
So what is GMOD Then?
14
Probably (definitely) the most commonly used of the GMOD components
It is a simple but extensible platform for displaying genomic data
It is maintained mostly by this man: Scott Cain
GBrowse
15
Many projects use GBrowse as their genome viewer
GBrowse installations
16
WormBase is to the C.elegans genome what the UCSC browser is to the human and mouse genomes. It is huge.
WormBase
17
FlyBase hosts many Drosophila genomes, though not with the depth of WormBase
WormBase is really at the top of non-UCSC browsers in it’s depth of information
This makes sense, given that nematodes are so heavily studied and very easy to work with.
FlyBase
18
The result of the first couple years of the class
Currently maintained by Lee Katz at the CDC
NBase
19
More from NBase
20
You can use colors for information
Darker genes had more programs that indicated them being horizontally transferred
This shows genes that we thought were horizontally transferred
21
We had a track of virulence factors in the first year
Clicking on any of them took you to details for the gene, a link to VFDB, etc.
You can also have specialized tracks
22
You can alter how tracks are show in other ways
Add and remove tracks, change the link that appears over a feature in the genome.
This goes beyond colors
23
One big, important thing:
“Genome browsers facilitate genomic analysis by presenting alignment, experimental and annotation data in the context of genomic DNA sequences.”
Melissa S Cline & James W Kent, 2009
Genome browsers, in short, aggregate data.
What do all of these have in common?
24
My rotifertranscriptome browser. It doesn’t have to be a genome
Not super exciting from this view. Just the predicted coding region of an assembled contig (mRNA)
You can do even more customization
25
All of this is in the conf
26
The relative ordering of things in a genome.
Just a few years ago, this was not available in GBrowse, it is now.
This could easily work for comparing different bacterial species
Synteny in GBrowse
27
GBrowse_syn on TAIR
28
It’s More interesting in WormBase
29
Are genome browsers useful?
30
We deal with huge volumes of data
The fall class will recall my hatred of GUIs
We want high-throughput
Genome browses give you none of this. None.
We are bioinformaticists
31
I spent quite a bit of time in undergrad doing bench work for Dr. Nils Kroger across the street.
I worked with these little guys:
Fascinating creatures
I cared about three genes:Sil1, Sil2, Sil3
They day the genome browsercame out changed the game
I wasn’t always a computer nerd
32
Still pretty useful
My main uses:
1. Make sure my data are correct. Are my intersections between genes and transposable element insertions correct?
2. Download hosted data.3. Make nice pictures4. Like a biologist, gene information about
specific genes
How useful is it for us?
33
How useful is it really?
It really depends on who you ask
It’s really for biologists: they find the browser, search for their favorite gene and get some details about it.
Once again, data aggregation.
In answer to the question
34
They were super excited about it
They use it all the time
It is like magic to them. If you were to show an iPhone to somebody from 1975, it would be pretty much the same thing. Almost.
The rotifer browser
35
Will it ever be the greatest genome browser?No. That will always be the UCSC browser
Will it remain the easiest to install for some time?Probably
Will you get the best return on time spentYep
Synteny is horribly conserved in Haemophilus, so avoid Gbrowse_syn for this class, but do keep it in mind.
Conclusion of GBrowse
36
Genome browsers:
Allow navigation of the genomeShow genomic features, whatever they areShow annotationsShow comparisons
Just to make sure you’ve got it
37
GBrowse, and all of GMOD, use GFF files
Generic Feature Format
Most of it is pretty simple.Chromosome(contig)start, stop, strand, id
The last column is what’s important. It lets you put whatever information about the feature you want in there.
It’s a very flexible format.
Database backends
38
Thanks for listening
Questions?