informatics infrastructure at the start of the second decade of dna barcoding

39
Informatics Infrastructure at the start of the Second Decade of DNA Barcoding SUJEEVAN RATNASINGHAM BIODIVERSITY INSTITUTE OF ONTARIO UNIVERSITY OF GUELPH

Upload: sratnasi

Post on 11-Feb-2017

374 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Informatics Infrastructure at the start of the Second Decade of DNA BarcodingSUJEEVAN RATNASINGHAM

BIODIVERSITY INSTITUTE OF ONTARIOUNIVERSITY OF GUELPH

10+ Years 4M+ Records 0.5M+ Species 100+ Nations

Decade 1 - Capacity Building

Community DatabaseCollaborative Networks Software Tools

Data Standards

Data Standards

Complete

Locus

Quality Provenance

Building the Library

10,000

100,000

1,000,000

10,000,000

2004 2006 2008 2010 2012 2014

barcodesLinnean SpeciesBINs

Spiders

Birds

LepidopteraOf North America

Birds of Argentina

IUCNRedList

CITES

Fish

BeesAmphibians

Mammals

Taxonomic

Thematic

Geographic

Community Benchmarks

Mosquitoes

Collaborative Networks

0

2

4

6

8

10

12

14

16

2004 2006 2008 2010 2012 2014

Regi

ster

ed U

sers

(Tho

usan

ds)

Data Sharing

2005 – 102 users from 30 institutions

1000+ Institutions from 94 countries sharing data on BOLD

Acr

oss N

atio

ns

Within Nations

100K+10K – 100K1K – 10K

BOLD User Network - 2015

CanadaFranceUSA GermanyCosta Rica

United Kingdom Switzerland

Acr

oss N

atio

ns

Within Nations

100K+10K – 100K1K – 10K

Finland

BOLD User Network - 2015

KenyaBelgiumMadagascar

NorwayAustria

SwedenJordan

ArgentinaChina Brazil

Spain Mexico PanamaPortugal

Pakistan Egypt South AfricaIndia

New Zealand

Netherlands

Acr

oss N

atio

ns

Within Nations

100K+10K – 100K1K – 10K

BOLD User Network - 2015

Acr

oss N

atio

ns

Within Nations

100K+10K – 100K1K – 10K

BOLD User Network - 2015

66 Other Countries from Every Continent

Testing the Library Depth

BBC Tree of Life, 2014

400K+163K+

80%500+

Specieswith Full Taxonomyof the BOLD LibraryOrders

Animals

Test Data:4000 species from 200 orders,20 per order

Reference

Queries

Testing the Library Depth

0

25

50

75

100

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Top

Ma

tch

Sim

ilarit

y >98%

>95%

>92%

>90%

Testing the Library Depth

Testing the Library Depth

−0.05 0.00 0.05 0.10 0.15

−0.06

−0.04

−0.02

0.00

0.02

PCA1

PC2

ColeopteraDipteraEphemeropteraHemipteraHymenopteraLepidopteraPlecopteraThysanopteraTrichoptera

−0.05 0.00 0.05 0.10

−0.04

−0.02

0.00

0.02

0.04

PC1

PC2

ColeopteraDipteraEphemeropteraHemipteraHymenopteraLepidopteraPlecopteraThysanopteraTrichoptera

K-mers (k=3)

Ratnasingham, Ma, Hebert, in prep.

Amino Acid Composition

Barcode Index Number (BIN)Algorithm

Registry

• Tuned to the marker (COX1)

• Fixed parameters for balanced OTU generation

• Uses prior threshold but refines for each group

• Occurrence of DNA Barcode (place and time)

• Aggregation of all associated metadata

• Reusable - works across studies

BOLD:AAF2716

Notioplusia illustratataOraesia Janzen 03 Ctenoplusia sp. ANIC1

0 50000 100000 150000 200000 250000 300000

Mammalia

Birds

Insecta

Fish

Araneae

Mollusca

Plants

Fungi

Importance of registering OTUs

SpeciesBarcode Index NumbersUnregistered OTUs

0

10,000

20,000

30,000

40,000

50,000

60,000

LATITUDINAL RANGE

Allow anyone, anywhere, to identify any organism

MAIL

LifeScanner Solution Overview

Species Identification

ID Engine

Sample Collection

Sequencing

PCR

Prep

Partner Sequencing

Labs

Moving into decade 2

Emb

race

Big

Da

ta

Impact

Reporting

Analysis

Monitoring

Forecasting

Complexity (Data volume & Dimensionality)

What happened?

Why it happened?

What is happening?

What might happen?

Support for multiple scales

102 103

104

104

Support for NGS based Barcoding

BOLD4

Launch of Beta on Sept 29,2015

Advancing the Automation ofSpecies Identification

Community Curated Libraries

Tier 3

Tier 2

Tier 1Purpose generated & reference specimens availableBarcode compliant & consistentKey species (e.g. Dirty 22, Domesticated & Bush meat)

Curated for consistency in taxonomic assignmentsBarcode compliant & consistentCITES/REDLIST (e.g. Endangered & controlled species)

Mined from BOLDLimited verification and only to be used as last resortDisease vectors & invasive species

78%

20%

25%

Community Defined Metadata Extensions

Rougerie R, Smith AM, Fernandez-Triana J, Lopez-Vaamonde C, Ratnasingham S Hebert PDN. 2011. Molecular analysis of parasitoid linkages (MAPL): gut contents of adult parasitoid wasps reveal larval host. Mol Ecol 20:179-186.

More Analytical Tools

��

����

��

����

��

����

��

�� ����� ����� ����� ����� ������ ������

��� �������

���������

�������������� ��������������������������� ���� ����

���������

��������� �

����������������������!�

����������������������!�

���������������� ������!�

������� ��������!�

���������������������!�

�����������������������!�

�����������������!�

����������������!�

��������������!�

��� ��� �� ��

Community Developed Analytical Tools

Plug-in Framework

6000+ Analytical Packages in CRAN

Analytical Pipelines

Simple Pipeline

BOLD 4 + SAP Lumira

BOLD4 – Some other features

• Checklist Support (synonyms, progress, shopping lists)

• Data portal for core facilities

• Complete record histories

• RESL algorithm on your own datasets

• Storage and analysis of pre-clustered NGS data

Support for Metabarcoding

mBRAVELinkages & Partners

Metabarcoding Research And Visualization Environment

Acknowledgements

BOLD Team

Paul HebertDan Janzen & Winnie HallwachsScott MillerAxel HausmannStefan KremerMany others!

Biodiversity Institute Collections TeamCanadian Center for DNA Barcoding Team

BOLD Users