grandrounds2004.ppt

Post on 10-May-2015

204 Views

Category:

Health & Medicine

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Mining Medical Mountains: How Bioinformatics Can Help

Medical Science

David Wishart

University of Alberta

The Library of Congress

• 120 million items in storage• 54 million manuscripts• 18 million books• 12 million photographs• 4.5 million maps• 4.4 million technical reports• 1.1 million PhD dissertations• ~20 Terabytes of data

Some Numbers…• 3 scientific journals in 1750• 120,000 scientific journals today• 500,000 medical articles/year• 4,000,000 scientific articles/year• 14,000,000 abstracts in PubMed derived from

4600 journals• 3,307,998,701 web pages on Google• 500,000,000,000,000 bytes on the Web

Some Numbers…

• A researcher would have to scan 130 different journals and read 27 papers per day to follow a single disease, such as breast cancer.

• Baasiri, R.A., Glasser, S.R., Steffen, D.L. & Wheeler, D.A. Oncogene 18, 7958-7965 (1999)

Some Graphs:

Multiplexed CE with Fluorescent detection

ABI 3700 96x700 bases

Genomes• 5 vertebrates (human, mouse, rat, fugu)

• 2 plants (arabadopsis, rice)• 2 insects (fruit fly, mosquito)• 2 nematodes (C. elegans, C. briggsae)• 1 sea squirt• 4 parasites (plasmodium, guillardia)• 4 fungi (S. cerevisae, S. pombe)• 140 bacteria and archebacteria• 1000+ viruses

The Human Genome

• 3.2 billion bases on 24 chromosomes

• 3,201,762,515 bases sequenced (99%)

• 23,531 - 31,609 genes (predicted)

• 50,000+ named genes (synonyms)

• 4000+ human diseases

• 850-1039 disease causing genes (ID’s)

A Tidal Wave of Data

Made worse by….

The Language of Biology

• The EGF receptor binds epidermal growth factor which triggers the phosphorylation of PLC-gamma followed by the binding and subsequent phosphorylation of Grb2 and SOS which leads to the formation of a Raf1-MEK complex which, in turn, leads to a p21ras auto-phosphorylation cascade. The complex then phosphorylates a MAP kinase which is transported to the nucleus via a nuclear transport signal which triggers the transcription of c-Fos, c-Myc and c-Jun which upon release in the rough ER are transported to…

How To Make Sense of This?

• How to acquire biological or medical knowledge from English text?

• How to build facts and relationships from scientific/medical articles?

• How to put 100+ years of useful data into readily accessible electronic repositories (the back fill problem)?

Some Solutions

• Text Mining…

• Create electronic repositories of abstracts and articles (PubMed/Entrez)

• Create glossaries & thesaurus’ of terms• Employ machine learning methods to parse

electronic text to extract or interpret key pieces of “atomic” information (SVM, Naïve Bayes, Reference Point Logistics, etc.)

PubMed

http://www.ncbi.nlm.nih.gov/PubMed/

PubMed• Allows users to search by journal, key

words, titles etc.

• Uses MeSH (Medical SubHeadings) to allow automated search of synonyms (renal transplant = kidney transplantation)

• API available to query PubMed automatically and remotely

• Few users know how to use PubMed properly or to its full extent

“ouellette bf” [au] AND yeast

Details

MeSH: Medical Subject Heading

("ouellette bf"[au] AND (("yeasts"[MeSH Terms] OR "saccharomyces cerevisiae"[MeSH Terms]) OR yeast[Text Word]))

Integrated Text/Sequence Searching with Entrez

PubCrawler

http://www.pubcrawler.ie/

PubCrawler• Free "alerting" service that scans daily

updates to the NCBI Medline (PubMed) and GenBank databases

• Lists new database entries that match search parameters (keywords, author names, etc.) specified by the user

• Results are presented as an HTML Web page (Entrez-like format)

• Can be downloaded or run as a service

MedMiner

http://discover.nci.nih.gov/textmining/filters.html

MedMiner

• A text miner that filters, extracts and organizes relevant sentences in the literature based on a gene, gene-gene or gene-drug query

• Combines GeneCards and PubMed searches with an integrated text filter

• L. Tanabe, U. Scherf, L. H. Smith, J. K. Lee, L. Hunter and J. N. Weinstein, (1999) BioTechniques 27:1210-1217.

MedGene

http://hipseq.med.harvard.edu/MEDGENE/login.jsp

MedGene• A list of human genes associated with a

particular human disease in ranking order • A list of human genes associated with multiple

human diseases in ranking order • A list of human diseases associated with a

particular human gene in ranking order • A list of human genes associated with a

particular human gene in ranking order• The sorted gene list from other disease related

high-throughput experiments, (i.e. micro-array

MedGene Performance

• Was able to identify >2400 genes associated with breast cancer in the literature

• Existing databases only list 260 genes (of which MedGene found 240)

• Could save ~100’s of hours of literature searching & combing

PolySearch

PolySearch

• Searches over 14 million PubMed Records

• Searches against 1622 diseases (and synonyms)

• Searches using 9300 genes with 42,500 synonyms

• Assesses quality using SCI list of impact factors for 8600+ journals

PolySearch• Supports PubMed text searching for gene &

disease associations (user provides disease name)

• Automatically scores & ID’s genes and searches for known SNPs or mutations against std. databases

• Grabs gene sequences and generates primers around SNPs

• Archives (MySQL database) or sends results as HTML page to user

Other Examples of Text or Web Mining

http://textomy.iit.nrc.ca/

Pre-BIND

• Donaldson et al. BMC Bioinformatics 2003 4:11

• Used Support Vector Machine (SVM) to scan literature for protein interactions

• Precision, accuracy and recall of 92% for correctly classifying PI abstracts

• Estimated to capture 60% of all abstracted protein interactions for a given organism

Proteome Analyst

• Uses Naïve Bayes methods in combination with sequence homology to identify “tokens” or nuggets of important information from text (titles, keywords, InterPro numbers and other data)

• Produces quantitative estimates (queryable reliability scores) of protein function, location, etc.

GenePublisher

• Processes raw genechip data and produces a publishable report in 1-2 hours of processor time

• Mines existing databases to build up or extract relationships

• Learns from previous analyses and remembers previous associations

http://www.cbs.dtu.dk/services/GenePublisher/

GenePublisher Output

Continuing Problems in Text Mining Biomedical

Literature are…

A Serious Naming Problem

• Sonic Hedgehog• Draculin• Profilactin• Knobhead• Lunatic Fringe• Fidgetin• Mortalin• Antiquitin• Accelerin

• Cockeye• Clootie Dumpling• SnaFu• Gleeful• Bang Senseless• Bride of Sevenless• Crack• Christmas Factor• Orphanin

And Exotic Terminology…

• J. Med. Genetics 10, 1962-6 (1973) "Mobius Syndrome with Poland’s Anomaly.“

• Heavy use of Eponyms (Werner’s syndrome, Down’s syndrome, Angelman’s syndrome, Creutzfeld-Jacob disease, etc. etc.)

Some Challenges

• How to name or describe proteins, genes, drugs, diseases and conditions consistently and coherently?

• How to ascribe and name a function, process or location consistently?

• How to describe interactions, partners, reactions and complexes?

• How to classify genes & proteins (a universal taxonomy of sequences and structures)?

Some Solutions

• Develop controlled or restricted vocabularies (IUPAC-like naming conventions)

• Create thesaurus’, central repositories or synonym lists (MeSH terms in PubMed)

• Work towards synoptic reporting and structured abstracting

Synoptic or Structured Abstract

J Am Acad Dermatol. 2004 Mar;50(3):431-4. Related Articles, Links

Demand outstrips supply of US pediatric dermatologists: Results from a national survey.

Hester EJ, McNealy KM, Kelloff JN, Diaz PH, Weston WL, Morelli JG, Dellavalle RP.

BACKGROUND: The US pediatric dermatology workforce was last examined in 1986 when limited employment

opportunity was found. OBJECTIVE: We sought to re-examine pediatric dermatology workforce issues. METHODS:

US dermatology chairpersons and residency program directors were surveyed for: (1) agreement with pediatric

dermatology workforce statements; and (2) pediatric dermatology faculty and fellow numbers. RESULTS: Respondents

agreed that having a pediatric dermatologist or dermatologists on faculty is important, and that a shortage of pediatric

dermatologists exists, but did not agree that increasing pediatric dermatology training requirements will increase this

shortage. Almost half of the programs (45/94) employed a full-time pediatric dermatologist, and 24 programs had

currently been recruiting a pediatric dermatologist for more than 1 year. Only 6 pediatric dermatology fellows were

in training. CONCLUSION: Given that open pediatric dermatology faculty positions greatly exceed the number of

fellows in training and that formal training requirements will be increasing, the shortage of pediatric dermatologists

will likely continue.

GO-Gene Ontology

• To produce a controlled vocabulary that changes as biological knowledge changes

• Categorizes according to 1) molecular function; 2) biological process; and 3) cellular component

• Represents contributions and consensus opinions from multiple experts in various fields

• Aim is to have every known protein and gene annotated consistently

http://www.geneontology.org/

NIH’s Medical Ontology Research Program

http://lhncbc.nlm.nih.gov/lhc/servlet/Turbine/template/home%2CHome.vm

MeSH

OMIM

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM

DrugBank

http://redpoll.pharmacy.ualberta.ca/drugbank/

Bioinformatics

Medinformatics

Conquering the Mountain

top related