interpopula

14
interPopula: Database and tool integration for population genetics With a focus on the HapMap project Tiago Rodrigues Ant ˜ ao http://popgen.eu/soft/interPop [email protected] Liverpool School of Tropical Medicine, UK interPopula – p. 1

Upload: tiago

Post on 11-Nov-2014

715 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: interPopula

interPopula: Database and toolintegration for population genetics

With a focus on the HapMap project

Tiago Rodrigues Antaohttp://popgen.eu/soft/interPop

[email protected]

Liverpool School of Tropical Medicine, UK

interPopula – p. 1

Page 2: interPopula

Preamble – the HapMap project(and UCSC Known Genes)

interPopula – p. 2

Page 3: interPopula

HapMap

The goal of the International HapMap Project is to developa haplotype map of the human genome, the HapMap, whichwill describe the common patterns of human DNAsequence variation. The HapMap is expected to be a keyresource for researchers to use to find genes affectinghealth, disease, and responses to drugs and environmentalfactors. The information produced by the Project will bemade freely available.http://hapmap.ncbi.nlm.nih.gov/

interPopula – p. 3

Page 4: interPopula

What is there?

11 pops, 90–180 individuals/pop (some cases with familytrios), >3M SNPs

Frequencies (e.g. for population P and SNP S, there are30% of As and 70% of Cs)

Genotypes (data per individual)

Phasing data

Pedigree info

LD (linkage disequilibrium) computations

Copy Number Variation (CNV) info – New!

A second generation human haplotype map of over 3.1million SNPs. Nature 449, 851-861. 2007.

interPopula – p. 4

Page 5: interPopula

UCSC Known Genes

A gene set constructed by an automated process, basedon protein data from Swiss-Prot/TrEMBL (UniProt) andthe associated mRNA data from Genbank

Inside UCSC Genome Browserhttp://genome.ucsc.edu/

Not only for humans (but options limited, less than ahandful of species)

Really useful for HapMap data (allows to relate SNPswith gene information in a much easier way than EntrezSNP)

Hsu et al, Bioinformatics, 2006 22(9):1036-1046 (but seeGenome Browser updates on NAR)

interPopula – p. 5

Page 6: interPopula

We now return to our regularlyscheduled program – interPopula

interPopula – p. 6

Page 7: interPopula

Introduction – 1

A Python library to access HapMap and UCSC KnownGenes data

A set of scripts providing integration examples.Integrating interPopula with Biopython, matplotlib,Genepop and Entrez SNP. Interaction with the ecology ofPopGen databases and Python tools encouraged

A set of guidelines to deal with inconsistencies acrossdatabases

Very easy to use, many examples

For Perl: Ensembl Variation API (Rios et al. BMCBioinformatics 2010, 11:238)

interPopula – p. 7

Page 8: interPopula

Introduction – 2

Python (2.6) based. Test coverage very high

Uses sqlite (Python built-in, no extra dependencies)

Creates a local SQL database from ftp data files

Can be disk and network intensive

Intelligent download: on-demand and never repeats thesame data twice

Database not normalized (for perfomance and spacereasons)

Family support (triage of offspring)

Data export (Genepop). X and Y aware.

interPopula – p. 8

Page 9: interPopula

HapMap example

To have a feel of the interface...

freqDB = Frequency()

freqDB.requireChrPop(chr, pop)

RSs = freqDB.getRSsForInterval(chr,

startPos, endPos)

for rs in RSs:

#We get frequency information

freqSNP = freqDB.getPopSNPs(pop, rs)

nuc1, nuc2 = freqSNP[5], freqSNP[6]

a1a1, a2a2, a1a2 = \

freqSNP[7], freqSNP[8], freqSNP[9]

interPopula – p. 9

Page 10: interPopula

UCSC Known Genes support

Everything is supported (not that much, just a long textfile plus a link table)

Get different IDs (Ascension ID, Prot ID, other links)

What is near a certain genomic position (chromosomeand position in chromosome)

Get exons for a certain gene

interPopula – p. 10

Page 11: interPopula

Integration

Many examples provided on interoperability (withmatplotlib, Entrez SNP, Genepop and Biopython)

Integrating heterogeneous databases

Databases do use different reference assemblies

Example: The exon positions given by the last versionof UCSC Table Browser are not compatible withHapMap (v37 vs v36)

Silent bug where rarely applications crash and resultsseem correct

This issue is discussed in the context ofHapMap/TableBrowser/EntrezSNP and might be usefulin other cases

interPopula – p. 11

Page 12: interPopula

Examples – Known Genes

interPopula – p. 12

Page 13: interPopula

Examples – HapMap/Integration

interPopula – p. 13

Page 14: interPopula

Future work

Focus on HapMap and maybe 1000 Genomes project

The whole UCSC Table Browser will be spin off later in adifferent project

Copy Number Variation support (since June on HapMap)

Phasing support due very soon (like next week)

Provide examples with genome wide association studies

interPopula – p. 14