wormbase - home | national academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf ·...

46
WormBase Todd Harris, PhD @tharris [email protected] CBPSS Mini Symposium

Upload: others

Post on 03-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

WormBase Todd Harris, PhD

@tharris [email protected] CBPSS Mini Symposium

Page 2: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Mission

Provide the biomedical research

community with accurate, current,

and accessible information on the

genetics, genomics, and biology of

the model system Caenorhabditis

elegans and related nematodes.

Page 3: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

C. elegans in 30 seconds Relatively simple organism, advanced genetic system.

Hermaphrodite

Male

1mM

Page 4: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Invariant lineage

C. elegans in 30 seconds

Page 5: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

302 neurons

Simple nervous system Described connectivity

C. elegans in 30 seconds

Page 6: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

A frozen C. elegans library Rapid generation time

C. elegans in 30 seconds

Page 7: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

100 MBp Genome

1998 (!)

C. elegans in 30 seconds

~20K genes

Page 8: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

A tradition of Open Science

1994 2000 1989 1974

1st genetic screen

published

BioNet

www

gopher

1963

Brenner’s

Letters

1995

Gazette AceDB

development

begins

2003

Page 9: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

The WormBase Consortium

Page 10: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

User Community

1106 laboratories

53 countries

3000 researchers

Country Labs

United States 594

Canada 62

United Kingdom 60

Japan 58

Germany 48

France 31

China 28

Spain 20

Switzerland 20

The Netherlands 16

Registered C. elegans laboratories

Page 11: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

User Community

185 countries

Biomedical researchers studying

aging, neurobiology, cancer, etc.

37K unique users/month

5.5M page views / month

Page 12: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

wormbase.org

Page 13: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Contents & Features

28 Species

Genomes

Genes

Orthology / Homology / Paralogy

Comparative Genomics

Strains / Antibodies / Oligos

Expression

Lineage & Connectivity

Authors & Publications

Labs

Reports

Genome Browsers

Alignment Tools

Query Tools

APIs

Data Mining Platforms

Social Features

FTP

Forums, Wikis, Blogs

Page 14: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Workflow

2. Integration & analysis

1. Curation

3. Presentation

Page 15: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Curation Goals

1. Extract data from the scientific

literature.

2. Develop standards to structure data.

3. Facilitate new insights by making

prose observations computable.

Page 16: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Curated Sources

Scientific literature (~30K papers)

User submissions

Genomic sequences (gene models)

3rd party datasets

Page 17: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Early Realizations Curation is hard and time-consuming!

Requires automation.

Need tools to facilitate.

Balance of breadth and depth critical for

making useful community resource.

Many data types.

Prioritization is key.

Work procedurally through data types.

Page 18: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Van Auken et al, Database, 2012

Hybrid automated/manual

curation strategy

Page 19: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Curated data types

Phenotypes Expression Patterns

Sequence Features Gene Interactions

Anatomy Function

Pathways

Reagents Human Disease Relevance

Page 20: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Reference datasets Large scale data at WormBase

• Proteomics (mass spec)

• Transcriptomics (splicing, UTRs)

• Expression (microarray, in vivo imaging)

• Interactions (physical, genetic)

• Perturbation: RNAi, systematic mutation

• Lineage and connectivity

Page 21: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Reference datasets

Broad reference data sets can

fill knowledge gaps

• Verification can be difficult

• Relevance?

• Utilization varies greatly.

Confidence?

Page 22: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Do we assess the quality of…

Publication is the gold standard.

experimental design? external data?

Revisit: erroneous data

Request corrections or clarifications when warranted

Page 23: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Remaining backlog

Page 24: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Curation: Lessons Learned

• harder and consumes more time than expected

• more enriching to the final product than expected

• curation ensures data integrity and builds trust in

the resource

Page 25: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Curation: Suggestions • Start early to develop best practices.

• Automate as much as possible.

• Employ domain experts for high value manual

curation and to confirm precision of automated

curation.

• Expect publication rate and new data types to

exceed manual curation capacity (10% Y-o-Y).

• Refining curation will be an ongoing enterprise.

Page 26: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

What fundamentals

have driven our

workflow design?

Page 27: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

1. Ease of data modeling and loading

What fundamentals have

driven our design?

Emphasis on collecting and sharing data.

Page 28: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

What fundamentals have

driven our design?

2. Handling unknown unknowns

Yet-to-be-discovered …

- datatypes

- data relationships

Data model must be able to evolve.

Page 29: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

3. Ability to track supporting evidence,

metadata, and provenance

Reproducibility and accountability.

What fundamentals have

driven our design?

Page 30: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

What fundamentals have

driven our design?

4. Coping with high-connectivity data

eg: What happens to downstream

annotations if gene merge? Orthology,

proteomics, expression, etc…

Page 31: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

What fundamentals have

driven our design?

5. Finding a suitable refresh rate

How often will you update analyses?

Datasets evolve. New data becomes

available. Analyses need to be

updated.

How tolerant will your community be of

stale data?

Page 32: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

What fundamentals have

driven our design?

5. Finding a suitable refresh rate

1 week -> 2 weeks -> 3 weeks -> 1 month -> 2 months

2001 2002 2005 2008 2011

Balance of stability, rate of new data,

cost/time of analysis, churn.

Page 33: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

1. A flexible model/workflow is essential.

2. Evidence and metdata collection needs

to be central to process.

3. High connectivity data presents unique

challenges.

4. Needed to adjust release frequency.

Design: Lessons Learned

Page 34: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Design: Suggestions

1. Build flexibility into both the data model

and workflow.

2. Be aware of consequences of changing

high connectivity data.

3. Refresh frequency is a balance of user

needs, resources, and rate of change.

Page 35: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Integration & Interoperability

Page 36: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Suggestions for integrating with

organismal databases (easy)

• Liaise with organismal databases early and often!

• Use stable identifiers! Most organism databases

have them. Please?

Page 37: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Suggestions for integrating with

organismal databases (harder)

Reciprocal data exchange and cross links

Crosslinks alone are boring and do not engage

users.

Without some supporting context, crosslinks do

not increase interoperability.

Page 38: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Suggestions for integrating with

organismal databases (hardest)

Avoid direct data import

Except for core scaffolding features (genomes,

genes, eg), use APIs to fetch and embed

functional data.

Page 39: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Interoperability Suggestions

1. Provide data in (multiple) common formats

2. API (RESTful) with JSON and XML delivery

3. Data files programmatically accessible —

simple is better (FTP), no registration barrier

or fancy web-based download scheme.

4. Consistent, shared identifiers

Page 40: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

If you build it, will they come?

Page 41: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Pageviews vs time

0

20,000,000

40,000,000

60,000,000

80,000,000

2001 2005 2010 2013

Page 42: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Nurture Your

Community Collect feedback Chat, Twitter, Google Alerts, mailing lists,

conferences, webinars, surveys.

Measure Web logs, CloudWatch, Google Analytics

Set standards Data quality, curation, submission,

help desk response times.

Page 43: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Metrics of success

Small user communities, niche domains.

Providing annotation or feedback is a low

priority for busy scientists.

Positive feedback rare, but you’ll know

when users don’t like something!

Not easy to measure.

Page 44: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Suggested Metrics

• Page Views

• Citation Rate

• Downloads

• Queries & Resolutions

• Rate / precision of curation

• Database size / objects / submissions

Page 45: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Performance Metrics

Page 46: WormBase - Home | National Academiessites.nationalacademies.org/.../webpage/ssb_160890.pdf · 2020-04-08 · Comparative Genomics Strains / Antibodies / Oligos Expression Lineage

Acknowledgments

Paul Sternberg

Juancarlos Chan

Wen Chen

Chris Grove

Raymond Lee

Ranjana Kishore

Cecilia Nakamura

Daniela Raciti

Gary Schindelman

Mary Ann Tuli

Kimberly Van Auken

Xiaodong Wang

Karen Yook

Hans-Michael Muller

Yuling Li

James Done

Lincoln Stein

Sibyl Gao

Todd Harris

Matt Berriman

Paul Kersey

Paul Davis

Thomas Done

Kevin Howe

Michael Paulini

Gary Williams

@tharris

@wormbase