ags 2014 i5k_workspace

18
The i5k Workspace@NAL – enabling genomic data access, visualization and curation for the i5k community Monica Poelchau and the i5k Workspace@NAL group

Upload: monica-poelchau

Post on 14-Jun-2015

172 views

Category:

Science


1 download

DESCRIPTION

Talk given at the 2014 Arthropod Genomics Symposium on the i5k Workspace@NAL.

TRANSCRIPT

Page 1: AGS 2014 i5k_workspace

The i5k Workspace@NAL – enabling genomic data access, visualization and curation for the i5k community

Monica Poelchau and the i5k Workspace@NAL group

Page 2: AGS 2014 i5k_workspace

The i5k Workspace@NAL

• Our goal: to help the i5k pilot project, and any other i5k genomes, during the post-annotation process

• Reminder: We are all i5k! Any arthropod genome project is welcome.

• NAL = National Agricultural Library• The i5k Workspace@NAL: a new collaboration

between the i5k coordinating committee and the NAL to develop, improve and implement tools that allow better access to genomic information for ‘orphaned’ i5k genomes

Monica Poelchau
Monica Poelchau
for the agricultural communityMy co-worker, Chris CHilders, and I were hired specifically to fulfull this mission.
Page 3: AGS 2014 i5k_workspace

CONTACT US: [email protected]

• I want the i5k Workspace@NAL to host my genome project.

• I want to use the i5k Workspace@NAL to access i5k genomic data.

• The i5k Workspace@NAL really needs to improve in X area for me to be willing to use it. I’ll let Monica know after the talk.

Page 4: AGS 2014 i5k_workspace

Photo credit Z. Szendrei

DNARNA

SequencingGenome assembly

Genome annotation AssemblyAnnotations

The life cycle of a ‘typical’ i5k genome project

Raw datafiles

How do you:• Share your data with collaborators; • Manually curate and improve your

annotations;• Visualize your assembly and

annotations;• Perform downstream analyses?

NCBI BioProject

i5k Workspace@NAL

iPlant(http://www.iplantc.org/)

Monica Poelchau
NCBI BioProjects won't take annotations, though - they'll take raw reads and the assembly (after an approval process)
Monica Poelchau
datasets are large and hard to transfer; need to communicate experimental details effectively
Monica Poelchau
Automated gene predictions, while good, are never perfect, and therefore benefit from manual editing. There are also many, many annotations.
Monica Poelchau
Datasets are large; therefore patterns can be difficult to infer without visualization
Monica Poelchau
Again - if you're not command-line savvy, perfoming even basic operations on a large dataset can be challenging.
Monica Poelchau
The biggest problems with genome projects boil down to: scale. Raw data is huge, and processed data is still large. How do you...
Page 5: AGS 2014 i5k_workspace

i5k Workspace

@ NAL

‘Frozen’ genome assembly

Automated annotations

Ancillary datafiles (e.g.

RNA-Seq alignments)

Organism page

Blast

Genome browserWeb Apollo

Bulk downloads

Integrated gene set

v1.0

Tutorials

Individual help and consultations

Page 6: AGS 2014 i5k_workspace

I5k Workspace@NAL in numbers• >5,000 pageviews

since April 10th, 2014

• 23 species available – all with ‘pre-release’ annotations

• 188+ registered annotators

• >1,700 gene models annotated

Page 7: AGS 2014 i5k_workspace

Taxonomic scopeArachnida: 1Crustacea: 1Palaeoptera: 2Paraneoptera/Hemiptera: 6Chrysomeloidea: 2Hymenoptera: 3Acalyptratae: 9

Monica Poelchau
Beetles
Monica Poelchau
flies?
Page 8: AGS 2014 i5k_workspace

Example workflow

Page 9: AGS 2014 i5k_workspace

Example workflow

Page 10: AGS 2014 i5k_workspace

Example workflow

Page 11: AGS 2014 i5k_workspace

Example workflow

Page 12: AGS 2014 i5k_workspace

Example workflow

Page 13: AGS 2014 i5k_workspace

Some notes on manual gene model curation

• Importance of gathering a community (see Monica Muñoz-Torres’ poster #48 on Saturday)

• Collaborating with other taxonomic or interest groups to perform annotations of the same gene family across (more or less) closely related species

• Helps to have RNA-Seq data, transcriptome alignments• Note that the quality of the automated annotations

depend on the quality of the genome assembly – lower N50s will generally mean more work for the manual annotators

Page 14: AGS 2014 i5k_workspace

Photo credit Z. Szendrei

DNARNA

SequencingGenome assembly

Genome annotationAssembly

AnnotationsGene models

The life cycle of a ‘typical’ i5k genome project

Raw datafiles

How do you:• Share your data with collaborators; • Manually curate and improve your

annotations;• Visualize your assembly and

annotations;• Perform downstream analyses. iPlant

(http://www.iplantc.org/)

Monica Poelchau
datasets are large and hard to transfer; need to communicate experimental details effectively
Monica Poelchau
Automated gene predictions, while good, are never perfect, and therefore benefit from manual editing. There are also many, many annotations.
Monica Poelchau
Datasets are large; therefore patterns can be difficult to infer without visualization
Monica Poelchau
Again - if you're not command-line savvy, perfoming even basic operations on a large dataset can be challenging.
Monica Poelchau
The biggest problems with genome projects boil down to: scale. Raw data is huge, and processed data is still large. How do you...
Monica Poelchau
NCBI BioProjects won't take annotations, though - they'll take raw reads and the assembly (after an approval process)
Page 15: AGS 2014 i5k_workspace

Performing downstream analyses

• iPlant and CoGe have various comparative genomics tools

• We are working with iPlant to integrate the i5k Workspace@NAL with iPlant’s datastore, which would allow immediate access to all of iPlant’s and CoGe’s analysis tools

Page 16: AGS 2014 i5k_workspace

Future plans and developments

• Move Workspace to iPlant to allow direct analysis in the iPlant cloud

• Metadata support (e.g. i5k-specific MIGS standards)

• Improved data submission tools• Quality metric implementation and visualization• Ongoing tool improvement and development

(including user suggestions)

Page 17: AGS 2014 i5k_workspace

Acknowledgements• The NAL team:

– Chris Childers– Vijaya Tsavatapalli– Gary Moore– Susan McCarthy– Chien-Yueh Lee, Han Lin, Jun-Wei

Lin

• I5k Workspace@NAL advisory committee: – Jay Evans– Don Gourley– Kevin Hackett– Simon Liu– Ursula Pieper

• The i5k coordinating committee

• The i5k pilot project• Wayne Hunter• All of our users and

data contributors!

Page 18: AGS 2014 i5k_workspace

CONTACT US: [email protected]

• We will have a site demo during TODAY’S poster session.

• I want the i5k Workspace@NAL to host my genome project.

• I want to use the i5k Workspace@NAL to access i5k genomic data.

• The i5k Workspace@NAL really needs to improve in X area for me to be willing to use it. I’ll let Monica know after the talk.