what's in a name? better vocabularies = better bioinformatics?

32
WHAT'S IN A NAME? Better vocabulary = better bioinformatics??? From flickr user giantginkgo # Author: Keith Bradnam, Genome Center, UC Davis # This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Upload: keith-bradnam

Post on 24-Jan-2015

1.363 views

Category:

Education


1 download

DESCRIPTION

Most of the pain and suffering that occurs in bioinformatics happens when database identifier 'A' in file 1, doesn't quite match database identifier 'B' in file 2...even when they are supposed to be the same identifier. Things don't always match up for a number of reasons, most of which *should* be under our control. This talk covers a few points relating to this and briefly discusses how we should all be using curated ontologies to describe our data.

TRANSCRIPT

Page 1: What's in a name? Better vocabularies = better bioinformatics?

WHAT'S IN A NAME?Better vocabulary = better bioinformatics???

From flickr user giantginkgo

# Author: Keith Bradnam, Genome Center, UC Davis# This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Page 2: What's in a name? Better vocabularies = better bioinformatics?

http://biomickwatson.wordpress.com

Most of the interesting 'stuff' that I discover about bioinformatics and genomics comes from a) twitter, b) blogs, and c) papers (in that order). Mick Watson has fun and engaging blog about bioinformatics and today he raised an important point: the lack of standardization in scientific databases leads to frustration (and frustration leads to...suffering).

Page 3: What's in a name? Better vocabularies = better bioinformatics?

http://biomickwatson.wordpress.com

These are some terms that appear in the same database. You can code solutions for some of this variation (e.g. British/American English differences or presence/absence of underscore vs space character), but who wants to waste time doing that? Shouldn't these databases be using controlled vocabularies?

Page 4: What's in a name? Better vocabularies = better bioinformatics?

This infamous paper from 2004 reveals how easy it is to introduce errors into biological databases.

Page 5: What's in a name? Better vocabularies = better bioinformatics?

First highlighted column = actual gene name.Second highlighted column = what Excel will automatically assume you mean.

Page 6: What's in a name? Better vocabularies = better bioinformatics?

RIKEN ID: 2310009E13

Happens for other identifiers as well. This RIKEN ID will change if it ever ends up in Excel...

Page 7: What's in a name? Better vocabularies = better bioinformatics?

RIKEN ID: 2.31E+13

...now it appears as a number in scientific notation.

Page 8: What's in a name? Better vocabularies = better bioinformatics?

The paper shows that these 'dates-as-gene-names' ended up propagating to other databases.

Page 9: What's in a name? Better vocabularies = better bioinformatics?

I searched today for '2-Sep' at GenBank and this was the only hit. It's possible that this is an intended gene-name variant, but Septin 2 is usually referred to as sep2/sept2/sep-2 etc. So this is possibly another Excel-based error.

Page 10: What's in a name? Better vocabularies = better bioinformatics?

Sometimes people make assumptions that gene names are unique to a specific function. DEC1 (one of the Excel-ified gene names mentioned in the earlier paper) can mean one thing to people working on many vertebrate species...

Page 11: What's in a name? Better vocabularies = better bioinformatics?

...but something else if you work on fruit flies. Dangerous to make any assumptions when it comes to gene names.

Page 12: What's in a name? Better vocabularies = better bioinformatics?

Consider one worm gene...

Here is one Caenorhabiditis elegans gene (abu-11) in WormBase. There is the official gene name, a sequence name, 'other' names, the WormBase gene ID, plus other identifiers for external databases which also describe the gene (there's also a protein ID, not shown here).

Page 13: What's in a name? Better vocabularies = better bioinformatics?

In C. elegans, gene names have a central naming authority (the CGC) but genes often get renamed. Just look at these pqn genes which have been renamed or merged with other genes.

Page 14: What's in a name? Better vocabularies = better bioinformatics?

This is the current view of the twk-43 gene in C. elegans (aka F32H5.7[abc]).

Page 15: What's in a name? Better vocabularies = better bioinformatics?

WormBase allows you to see the history behind genes. This gene started out as just F32H5.2, a gene with no splice isoforms.

Page 16: What's in a name? Better vocabularies = better bioinformatics?

Then at some point it was split into 3 genes...

Page 17: What's in a name? Better vocabularies = better bioinformatics?

...before being converted into the current one gene (with four splice isoforms). Genes are split and merged and renamed all the time. Relying on the common gene name (e.g. twk-43) or the sequence identifier (F32H5.7) can get you into trouble.

Page 18: What's in a name? Better vocabularies = better bioinformatics?

SOLUTIONS

What can be done to help with these sorts of problems?

Page 19: What's in a name? Better vocabularies = better bioinformatics?

Use ontologies and understand what those ontologies do.

Page 20: What's in a name? Better vocabularies = better bioinformatics?

Three main parts to a Gene Ontology term (GO term):1) The name2) The accession3) The definition (which can change)

Page 21: What's in a name? Better vocabularies = better bioinformatics?

A fourth major part of a GO term is that it has ancestors and children. A single term is 'part of' other terms and also 'is' examples of other terms. E.g. a nuclear outer membrane *is* a nuclear membrane and is *part of* the cell.

Page 22: What's in a name? Better vocabularies = better bioinformatics?

Most model organism databases are loaded up with GO terms. E.g. you can search GO terms from the 'front door' of FlyBase.

Page 23: What's in a name? Better vocabularies = better bioinformatics?

In WormBase, the same GO term search takes you directly to a gene page.

Page 24: What's in a name? Better vocabularies = better bioinformatics?

Scroll down on that gene page and we see the specified GO term...but what is an 'evidence code', and what does 'IDA' mean?

Sadly the majority of people who use GO terms (as part of 'DAVID' analyses etc.) have no knowledge of evidence codes

Page 25: What's in a name? Better vocabularies = better bioinformatics?

All GO terms should be connected to genes (or other database entries) with evidence codes. Gives you an idea of how robust the assignment is. Databases like WormBase have curators that scan papers (by eye, but also with software) to find suitable GO terms that can be added to genes on the basis of experiments described in the paper.

Page 26: What's in a name? Better vocabularies = better bioinformatics?

Most of the GO terms you will ever see have this evidence code. It is among the weakest of all evidence (avoid any evidence which is 'non-traceable author statement'). It could simply mean that a human protein (with some known information) was BLASTed against a yeast genome and the resulting yeast match acquired the human meta-information as GO terms. IEA codes should be treated with some suspicion.

Page 27: What's in a name? Better vocabularies = better bioinformatics?

48.2% of GO annotations — in one of the best annotated eukaryotic animal genomes —

are generated automaticallyThe Gene Ontology website shows how many GO terms are attached to genes in different organisms. Even in C. elegans (with >15 years of gene annotation), about half of the GO terms are all in the IEA category.

Page 28: What's in a name? Better vocabularies = better bioinformatics?

Gene Ontology is not the only game in town. Sequence Ontology (SO) is widely used and a subset of SO terms are used in GFF files to describe features (or at least they should be!).

Page 29: What's in a name? Better vocabularies = better bioinformatics?

GO and SO are part of OBO (Open Biological Ontologies: http://www.obofoundry.org).There may be a community developing an ontology for your field of interest. This site lists them all.

Page 30: What's in a name? Better vocabularies = better bioinformatics?

Some get very specific.

Page 31: What's in a name? Better vocabularies = better bioinformatics?

SUMMARY

Page 32: What's in a name? Better vocabularies = better bioinformatics?

Use ontologies whenever possible

Don't assume that identifiers in existing databases are the correct (or only) identifiers

Be careful when inflicting new database identifiers on to the world!

On the last point, check whether your identifiers (even if they end up buried in supplementary material somewhere) don't conflict with other databases out there. Long and boring identifiers are usually the most stable and more easily parsed by scripts (although they are the least human-friendly). But no spaces or asterisks in identifiers please!This talk is KORF_labtalk_00000315