the ensembl online training series 2016 › training › online › sites ›...

Ben Moore

Ensembl Outreach

EMBL-EBI

The Ensembl online training series 2016

This webinar courseDate Webinar topic Instructor

24th March

Introduction to Ensembl Emily Perry

31st March

Ensembl genes Denise Carvalho-Silva

7th April Data export with BioMart Helen Sparrow

14th April

Variation data in Ensembl and the Ensembl VEP Denise Carvalho-Silva

21st April

Comparing genes and genomes with Ensembl Compara Helen Sparrow

28th April

Finding features that regulate genes – the Ensembl Regulatory Build

Emily Perry

5th May Uploading your data to Ensembl and advanced ways to access Ensembl data

Ben Moore

Objectives

• What is Ensembl?

• What type of data can you get in Ensembl?

• How to navigate the Ensembl browser website.

• Where to go for help and documentation.

Structure

Presentation:What the data/tool isHow we produce/process the data

Demo:Getting the data

Using the tool

Exercises:On the train online course

Questions?

• Ask questions in the Chat box in the webinar interface

• My Ensembl colleagues will respond during the webinar

• There’s no threading so please respond with @username

Helen Sparrow Emily Perry Denise Carvalho-Silva

Poll Questions

- Poll 1: Did you attend the previous webinars?

- Poll 2: Have you done the previous exercises?

Course exercises

http://www.ebi.ac.uk/training/online/course/ensembl-browser-webinar-series-2016

This text will be replaced by a YouTube (link to YouKu too) video of the webinar

and a pdf of the slides.

The “next page” will be the exercisesA link to exercises and

their solutions will appear in the page

hierarchy

Get help with the exercises

• Use the exercise solutions in the online course

• Join our Facebook group and discuss the exercises with everybody (see the online course for the link)

• Email us [email protected]

EBI is an Outstation of the European Molecular Biology Laboratory.

Custom Data and Advanced Data Access

Viewing your own data in EnsemblAdd custom tracks with your own data:

- BAM files - GTF/GFF - BED/BEDGraph files - BigWig

- PSL -VCF - Pairwise interactions

http://www.ensembl.org/info/website/upload/index.

html#formats

http://www.ensembl.org/info/website/upload/index.html#formats



http://www.ensembl.org/info/website/upload/large.html#bw-format

http://www.ensembl.org/info/website/upload/large.html#vcf-format

Hands on

We’re going to map large-scale deletions from patients with microcephaly and developmental delay by uploading BED files.

chr5 36821632 37091234 P1chr5 36731476 36978306 P2chr5 36908552 37108671 P3

Advanced data access

• Accessing data at different scales:

• Full database download from the FTP site

• Direct database access with MySQL

• Programmatic access with the Perl API

• Fast and flexible access with the REST API

Access scales

Whole genome

Groups

One by oneMain browserMobile site

BioMartREST APIVEP

Perl APIMySQL

FTP

FTP

• Files of our complete database:• Genomic, cDNA, CDS, ncRNA and protein sequence

(FASTA)

• Annotated sequence (EMBL, GenBank)

• Gene sets (GTF, GFF)

• Whole-genome multiple and gene-based multiple alignments (MAF)

• Variants (VCF, GVF)

• Constrained elements (BED)

• Regulatory features (BED, BigWig)

• RNA-Seq files (BAM, BigWig)

• MySQL database

Access FTP

Your favourite FTP client

FTP downloads pagehttp://www.ensembl.org/info/data/ftp/index.html

FTP siteftp://ftp.ensembl.org/pub/

FTP files are big

• Multiple Mb/Gb

• Lots of time to download/unzip

• Do you really need this data?

• Make sure it’s the right file before you download.

FTP site summary

Skills needed Web-browsing or FTP client use. Handling and parsing file types.

Scalability Whole database only

Speed Many minutes to download and decompress a file.

Difficulty to query Files easy to download and decompress

Long-term New files with each release. File types stay the same

Sequences? Yes

Ensembl data through MySQL

• Direct database querying using MySQL queries

• http://www.ensembl.org/info/data/mysql.html

mysql -u anonymous -h ensembldb.ensembl.org

mysql> use homo_sapiens_core_82_38;

mysql> SELECT gene.stable_id FROM gene, xref WHERE gene.display_xref_id = xref.xref_id AND xref.display_label LIKE ’brca2';

MySQL schema

Ensembl data through MySQL

• I have an Ensembl gene ID - ENSG00000079950• I want to get the EntrezGene ID for this gene

• We need to refer to the schema:

http://www.ensembl.org/info/docs/api/core/ core_schema.html

The schema

Choose the external database:external_bd.db_display_name = “EntrezGene”

Get the gene:gene.stable_id = "ENSG00000079950"

Get the xref ID:select xref.display_label

Choose our tables: from xref, object_xref, gene, external_db

Link the xref to the external database:external_db.external_db_id=xref.external_db_id

Link the object xref to the gene ID: object_xref. ensembl_id= gene.gene_id

Link the xref to the object xref:xref.xref_id=object_xref.xref_id

Specify you want gene xrefs:xref.ensembl_object_type = 'Gene'

Our query

/usr/local/mysql/bin/mysql -h ensembldb.ensembl.org -u anonymous -P 3306

use homo_sapiens_core_82_38;

select gene.stable_id, xref.display_label from xref, object_xref, gene, external_db where xref.xref_id=object_xref.xref_id and object_xref.ensembl_id=gene.gene_id and gene.stable_id = "ENSG00000079950" and external_db.external_db_id=xref.external_db_id and external_db.db_display_name= "EntrezGene";

Use port 3337

for GRCh37

MySQL summary

Skills needed MySQL querying. Understanding of the schema.

Scalability Can query whole genome.

Speed Minimum query speed 9ms. Time cost for complexity of query.

Difficulty to query Queries can get very complicated if extracting data from multiple tables and are often not reusable.

Long-term The schema can change between releases.

Sequences? No

Ensembl data through the Perl API

• Database querying using Perl scripts • We use object-oriented Perl

my $gene_adaptor = $registry->get_adaptor( 'human', 'core', ‘gene' );

my $gene = $gene_adaptor->fetch_by_display_label( 'brca2' );

print $gene->stable_id, "\n";

http://www.ensembl.org/info/data/api.html

Perl API

Learn Perl

download API modules

Learn Ensembl API

(download more modules)

Write scripts

Get out all possible Ensembl data. Output in any

format you like.

Learn to use the APIEBI Train Online course:http://www.ebi.ac.uk/training/online/course/ensembl-filmed-api-workshop

API documentation:http://www.ensembl.org/info/docs/Doxygen/core-api/index.html

Ensembl data through the Perl API

• I want a script that gets a gene name from the command line and prints its sequence.

• We’ve already learnt how to use the API and know our way around the documentation

• We need to write a script.

Perl script

#!/usr/bin/perl

use strict;

use warnings;

# Load the Ensembl API registry

use Bio::EnsEMBL::Registry;

my $reg = "Bio::EnsEMBL::Registry";

$reg->load_registry_from_db(

-host => 'ensembldb.ensembl.org',

-user => 'anonymous'

);

# Get the gene name from the command line

my $gene_name = shift;

# Get the gene adaptor - this allows you to fetch genes from the database

my $gene_adaptor = $reg->get_adaptor ('human', 'core', 'gene');

# Get genes using the gene adaptor

my @genes = @{ $gene_adaptor->fetch_all_by_external_name ($gene_name) };

# move through the genes one-by-one

while (my $gene = shift @genes) {

# print the gene name, ID and sequence

print "> ", $gene_name , " ", $gene->stable_id, "\n", $gene->seq, "\n";

}

Use port 3337

for GRCh37

Perl API summary

Skills needed Programming in Perl. Understanding of features in Ensembl.

Scalability Can query whole database

Speed 1s for start-up plus minimum query speed 50ms. Time cost per datapoint.

Difficulty to query Scripts needed, but these can be easily reused and adapted. API links data easily.

Long-term The API takes databases changes into account, so scripts do not need to change between releases

Sequences? Yes

Data access via REST

• We’ve had a Perl API for a long time …• … but not everybody works in Perl• Our RESTful service allows language agnostic access to

our data.• Visit rest.ensembl.org for installation, documentation

and examples

What is REST?

• REST allows you to query the database using simple URLs giving output in plain text format

eg http://rest.ensembl.org/xrefs/symbol/homo_sapiens/BRCA2?content-type=application/json

gives [{"type":"gene","id":"ENSG00000139618"},{"type":"gene","id":"LRG_293"}]

• This means you can write scripts in any language to construct these URLs and read their output

Single endpoint demo

• I want to get a gene sequence from an Ensembl gene ID

• I need to use the docs to find an appropriate endpoint: http://rest.ensembl.org/

http://rest.ensembl.org/sequence/id/ENSG00000157764

Use grch37.rest.ensembl.org

for GRCh37

Scripting demo

• I want a script that gets a gene name from the command line and prints its sequence.

• There’s no one endpoint that does this action, so I have to combine two endpoints with a script

Python script#!/usr/bin/env python

# Get modules needed for script

import sys

import urllib

import urllib2

import json

import time

import httplib2, sys

http = httplib2.Http(".cache")

# Get the gene name from the command line

gene_name = sys.argv[1]

# define the general URL parameters

server = "http://rest.ensembl.org"

# define REST query to get the gene ID from the gene name

ext_get_gene = "/xrefs/symbol/homo_sapiens/" + gene_name + "?"

# submit the query

resp, get_genes = http.request(server+ext_get_gene, method="GET", headers={"Content-Type":"application/json"})

# decode the json output

import json

genes = json.loads(get_genes)

# move through the genes one-by-one

for gene in genes:

# define the REST query to get the sequence from the gene

ext_get_seq = '/sequence/id/' + gene['id'] + '?';

# submit the query

resp, get_seq = http.request(server+ext_get_seq, method="GET", headers={"Content-Type":"application/json"})

# decode the json output

import json

seq = json.loads(get_seq)

# print the gene name, ID and sequence

print '>', gene_name, gene['id'], "\n", seq['seq']

POST demo

• Some endpoints can perform multiple queries at once using POST

• Use Postman https://www.getpostman.com/

POST demo

Choose POST Input endpoint

POST demoChoose

Body

Input IDs in json

{ "ids" : ["ENST00000000233", "ENST00000000412", "ENST00000000442", "ENST00000001008", "ENST00000001146", "ENST00000002125", "ENST00000002165", "ENST00000002501", "ENST00000002596", "ENST00000002829", "ENST00000003084", "ENST00000003100", "ENST00000003302", "ENST00000003583", "ENST00000003912", "ENST00000004103", "ENST00000004531", "ENST00000004982", "ENST00000005082", "ENST00000005178", "ENST00000005180", "ENST00000005226", "ENST00000005257", "ENST00000005259", "ENST00000005260", "ENST00000005284", "ENST00000005286", "ENST00000005340", "ENST00000005374", "ENST00000005386", "ENST00000005558", "ENST00000005756", "ENST00000005995", "ENST00000006015", "ENST00000006053", "ENST00000006251", "ENST00000006275", "ENST00000006658", "ENST00000006724", "ENST00000006750", "ENST00000006777", "ENST00000007264", "ENST00000007390", "ENST00000007414", "ENST00000007510", "ENST00000007516", "ENST00000007699", "ENST00000007708", "ENST00000007722", "ENST00000007735" ] }

POST demoChoose Headers

Click on the pencil

Add Content-Type...

...application/json by typing the first few letters

then selecting

POST demo

REST summarySkills needed Understanding of features in Ensembl.

Possibly programming in any language

Scalability With programming can query whole database

Speed Minimum query speed 150ms. Time cost per datapoint.

Difficulty to query Need to construct URLs. May need scripts to dissect data.

Long-term The API takes databases changes into account, so URLs do not need to change between releases.

Sequences? Yes

Webinar course feedback

We will send a SurveyMonkey feedback survey for this webinar series by e-mail:

PLEASE fill it out to tell us whether you have enjoyed and benefitted from the course!

Host an Ensembl course

Browser course

½-2 day course on the Ensembl browser, aimed at wet-lab scientists.

One trainer.

API course

1-4 day course on the Ensembl Perl API, aimed at bioinformaticians.

1-4 trainers.

http://www.ensembl.info/workshops/

We can teach an Ensembl course at your institute for free (except trainers’ expenses).

Email me: [email protected]

Help and documentationCourse online http://www.ebi.ac.uk/training/online/subjects/11

Tutorials www.ensembl.org/info/website/tutorials

Flash animations

www.youtube.com/user/EnsemblHelpdesk

http://u.youku.com/Ensemblhelpdesk

Email us [email protected]

Ensembl public mailing lists [email protected], [email protected]

Publications

Yates, A. et al

Ensembl 2016

Nucleic Acids Research

http://nar.oxfordjournals.org/content/early/2015/12/19/nar.gkv1157.full

Xosé M. Fernández-Suárez and Michael K. SchusterUsing the Ensembl Genome Server to Browse Genomic Sequence Data.Current Protocols in Bioinformatics 1.15.1-1.15.48 (2010)www.ncbi.nlm.nih.gov/pubmed/20521244

Giulietta M Spudich and Xosé M Fernández-SuárezTouring Ensembl: A practical guide to genome browsingBMC Genomics 11:295 (2010)www.biomedcentral.com/1471-2164/11/295

http://www.ensembl.org/info/about/publications.html

Follow us

www.facebook.com/Ensembl.org

@Ensembl www.ensembl.info

Acknowledgements

the ensembl online training series 2016 › training › online › sites ›...

Documents