the ensembl online training series 2016 › training › online › sites ›...
TRANSCRIPT
This webinar courseDate Webinar topic Instructor
24th March
Introduction to Ensembl Emily Perry
31st March
Ensembl genes Denise Carvalho-Silva
7th April Data export with BioMart Helen Sparrow
14th April
Variation data in Ensembl and the Ensembl VEP Denise Carvalho-Silva
21st April
Comparing genes and genomes with Ensembl Compara Helen Sparrow
28th April
Finding features that regulate genes – the Ensembl Regulatory Build
Emily Perry
5th May Uploading your data to Ensembl and advanced ways to access Ensembl data
Ben Moore
Objectives
• What is Ensembl?
• What type of data can you get in Ensembl?
• How to navigate the Ensembl browser website.
• Where to go for help and documentation.
Structure
Presentation:What the data/tool isHow we produce/process the data
Demo:Getting the data
Using the tool
Exercises:On the train online course
Questions?
• Ask questions in the Chat box in the webinar interface
• My Ensembl colleagues will respond during the webinar
• There’s no threading so please respond with @username
Helen Sparrow Emily Perry Denise Carvalho-Silva
Poll Questions
- Poll 1: Did you attend the previous webinars?
- Poll 2: Have you done the previous exercises?
Course exercises
http://www.ebi.ac.uk/training/online/course/ensembl-browser-webinar-series-2016
This text will be replaced by a YouTube (link to YouKu too) video of the webinar
and a pdf of the slides.
The “next page” will be the exercisesA link to exercises and
their solutions will appear in the page
hierarchy
Get help with the exercises
• Use the exercise solutions in the online course
• Join our Facebook group and discuss the exercises with everybody (see the online course for the link)
• Email us [email protected]
EBI is an Outstation of the European Molecular Biology Laboratory.
Custom Data and Advanced Data Access
Viewing your own data in EnsemblAdd custom tracks with your own data:
- BAM files - GTF/GFF - BED/BEDGraph files - BigWig
- PSL -VCF - Pairwise interactions
http://www.ensembl.org/info/website/upload/index.
html#formats
Hands on
We’re going to map large-scale deletions from patients with microcephaly and developmental delay by uploading BED files.
chr5 36821632 37091234 P1chr5 36731476 36978306 P2chr5 36908552 37108671 P3
Advanced data access
• Accessing data at different scales:
• Full database download from the FTP site
• Direct database access with MySQL
• Programmatic access with the Perl API
• Fast and flexible access with the REST API
Access scales
Whole genome
Groups
One by oneMain browserMobile site
BioMartREST APIVEP
Perl APIMySQL
FTP
FTP
• Files of our complete database:• Genomic, cDNA, CDS, ncRNA and protein sequence
(FASTA)
• Annotated sequence (EMBL, GenBank)
• Gene sets (GTF, GFF)
• Whole-genome multiple and gene-based multiple alignments (MAF)
• Variants (VCF, GVF)
• Constrained elements (BED)
• Regulatory features (BED, BigWig)
• RNA-Seq files (BAM, BigWig)
• MySQL database
Access FTP
Your favourite FTP client
FTP downloads pagehttp://www.ensembl.org/info/data/ftp/index.html
FTP siteftp://ftp.ensembl.org/pub/
FTP files are big
• Multiple Mb/Gb
• Lots of time to download/unzip
• Do you really need this data?
• Make sure it’s the right file before you download.
FTP site summary
Skills needed Web-browsing or FTP client use. Handling and parsing file types.
Scalability Whole database only
Speed Many minutes to download and decompress a file.
Difficulty to query Files easy to download and decompress
Long-term New files with each release. File types stay the same
Sequences? Yes
Ensembl data through MySQL
• Direct database querying using MySQL queries
• http://www.ensembl.org/info/data/mysql.html
mysql -u anonymous -h ensembldb.ensembl.org
mysql> use homo_sapiens_core_82_38;
mysql> SELECT gene.stable_id FROM gene, xref WHERE gene.display_xref_id = xref.xref_id AND xref.display_label LIKE ’brca2';
Ensembl data through MySQL
• I have an Ensembl gene ID - ENSG00000079950• I want to get the EntrezGene ID for this gene
• We need to refer to the schema:
http://www.ensembl.org/info/docs/api/core/ core_schema.html
The schema
Choose the external database:external_bd.db_display_name = “EntrezGene”
Get the gene:gene.stable_id = "ENSG00000079950"
Get the xref ID:select xref.display_label
Choose our tables: from xref, object_xref, gene, external_db
Link the xref to the external database:external_db.external_db_id=xref.external_db_id
Link the object xref to the gene ID: object_xref. ensembl_id= gene.gene_id
Link the xref to the object xref:xref.xref_id=object_xref.xref_id
Specify you want gene xrefs:xref.ensembl_object_type = 'Gene'
Our query
/usr/local/mysql/bin/mysql -h ensembldb.ensembl.org -u anonymous -P 3306
use homo_sapiens_core_82_38;
select gene.stable_id, xref.display_label from xref, object_xref, gene, external_db where xref.xref_id=object_xref.xref_id and object_xref.ensembl_id=gene.gene_id and gene.stable_id = "ENSG00000079950" and external_db.external_db_id=xref.external_db_id and external_db.db_display_name= "EntrezGene";
Use port 3337
for GRCh37
MySQL summary
Skills needed MySQL querying. Understanding of the schema.
Scalability Can query whole genome.
Speed Minimum query speed 9ms. Time cost for complexity of query.
Difficulty to query Queries can get very complicated if extracting data from multiple tables and are often not reusable.
Long-term The schema can change between releases.
Sequences? No
Ensembl data through the Perl API
• Database querying using Perl scripts • We use object-oriented Perl
my $gene_adaptor = $registry->get_adaptor( 'human', 'core', ‘gene' );
my $gene = $gene_adaptor->fetch_by_display_label( 'brca2' );
print $gene->stable_id, "\n";
http://www.ensembl.org/info/data/api.html
Perl API
Learn Perl
download API modules
Learn Ensembl API
(download more modules)
Write scripts
Get out all possible Ensembl data. Output in any
format you like.
Learn to use the APIEBI Train Online course:http://www.ebi.ac.uk/training/online/course/ensembl-filmed-api-workshop
API documentation:http://www.ensembl.org/info/docs/Doxygen/core-api/index.html
Ensembl data through the Perl API
• I want a script that gets a gene name from the command line and prints its sequence.
• We’ve already learnt how to use the API and know our way around the documentation
• We need to write a script.
Perl script
#!/usr/bin/perl
use strict;
use warnings;
# Load the Ensembl API registry
use Bio::EnsEMBL::Registry;
my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_db(
-host => 'ensembldb.ensembl.org',
-user => 'anonymous'
);
# Get the gene name from the command line
my $gene_name = shift;
# Get the gene adaptor - this allows you to fetch genes from the database
my $gene_adaptor = $reg->get_adaptor ('human', 'core', 'gene');
# Get genes using the gene adaptor
my @genes = @{ $gene_adaptor->fetch_all_by_external_name ($gene_name) };
# move through the genes one-by-one
while (my $gene = shift @genes) {
# print the gene name, ID and sequence
print "> ", $gene_name , " ", $gene->stable_id, "\n", $gene->seq, "\n";
}
Use port 3337
for GRCh37
Perl API summary
Skills needed Programming in Perl. Understanding of features in Ensembl.
Scalability Can query whole database
Speed 1s for start-up plus minimum query speed 50ms. Time cost per datapoint.
Difficulty to query Scripts needed, but these can be easily reused and adapted. API links data easily.
Long-term The API takes databases changes into account, so scripts do not need to change between releases
Sequences? Yes
Data access via REST
• We’ve had a Perl API for a long time …• … but not everybody works in Perl• Our RESTful service allows language agnostic access to
our data.• Visit rest.ensembl.org for installation, documentation
and examples
What is REST?
• REST allows you to query the database using simple URLs giving output in plain text format
eg http://rest.ensembl.org/xrefs/symbol/homo_sapiens/BRCA2?content-type=application/json
gives [{"type":"gene","id":"ENSG00000139618"},{"type":"gene","id":"LRG_293"}]
• This means you can write scripts in any language to construct these URLs and read their output
Single endpoint demo
• I want to get a gene sequence from an Ensembl gene ID
• I need to use the docs to find an appropriate endpoint: http://rest.ensembl.org/
http://rest.ensembl.org/sequence/id/ENSG00000157764
Use grch37.rest.ensembl.org
for GRCh37
Scripting demo
• I want a script that gets a gene name from the command line and prints its sequence.
• There’s no one endpoint that does this action, so I have to combine two endpoints with a script
Python script#!/usr/bin/env python
# Get modules needed for script
import sys
import urllib
import urllib2
import json
import time
import httplib2, sys
http = httplib2.Http(".cache")
# Get the gene name from the command line
gene_name = sys.argv[1]
# define the general URL parameters
server = "http://rest.ensembl.org"
# define REST query to get the gene ID from the gene name
ext_get_gene = "/xrefs/symbol/homo_sapiens/" + gene_name + "?"
# submit the query
resp, get_genes = http.request(server+ext_get_gene, method="GET", headers={"Content-Type":"application/json"})
# decode the json output
import json
genes = json.loads(get_genes)
# move through the genes one-by-one
for gene in genes:
# define the REST query to get the sequence from the gene
ext_get_seq = '/sequence/id/' + gene['id'] + '?';
# submit the query
resp, get_seq = http.request(server+ext_get_seq, method="GET", headers={"Content-Type":"application/json"})
# decode the json output
import json
seq = json.loads(get_seq)
# print the gene name, ID and sequence
print '>', gene_name, gene['id'], "\n", seq['seq']
POST demo
• Some endpoints can perform multiple queries at once using POST
• Use Postman https://www.getpostman.com/
POST demoChoose
Body
Input IDs in json
{ "ids" : ["ENST00000000233", "ENST00000000412", "ENST00000000442", "ENST00000001008", "ENST00000001146", "ENST00000002125", "ENST00000002165", "ENST00000002501", "ENST00000002596", "ENST00000002829", "ENST00000003084", "ENST00000003100", "ENST00000003302", "ENST00000003583", "ENST00000003912", "ENST00000004103", "ENST00000004531", "ENST00000004982", "ENST00000005082", "ENST00000005178", "ENST00000005180", "ENST00000005226", "ENST00000005257", "ENST00000005259", "ENST00000005260", "ENST00000005284", "ENST00000005286", "ENST00000005340", "ENST00000005374", "ENST00000005386", "ENST00000005558", "ENST00000005756", "ENST00000005995", "ENST00000006015", "ENST00000006053", "ENST00000006251", "ENST00000006275", "ENST00000006658", "ENST00000006724", "ENST00000006750", "ENST00000006777", "ENST00000007264", "ENST00000007390", "ENST00000007414", "ENST00000007510", "ENST00000007516", "ENST00000007699", "ENST00000007708", "ENST00000007722", "ENST00000007735" ] }
POST demoChoose Headers
Click on the pencil
Add Content-Type...
...application/json by typing the first few letters
then selecting
REST summarySkills needed Understanding of features in Ensembl.
Possibly programming in any language
Scalability With programming can query whole database
Speed Minimum query speed 150ms. Time cost per datapoint.
Difficulty to query Need to construct URLs. May need scripts to dissect data.
Long-term The API takes databases changes into account, so URLs do not need to change between releases.
Sequences? Yes
Webinar course feedback
We will send a SurveyMonkey feedback survey for this webinar series by e-mail:
PLEASE fill it out to tell us whether you have enjoyed and benefitted from the course!
Host an Ensembl course
Browser course
½-2 day course on the Ensembl browser, aimed at wet-lab scientists.
One trainer.
API course
1-4 day course on the Ensembl Perl API, aimed at bioinformaticians.
1-4 trainers.
http://www.ensembl.info/workshops/
We can teach an Ensembl course at your institute for free (except trainers’ expenses).
Email me: [email protected]
Help and documentationCourse online http://www.ebi.ac.uk/training/online/subjects/11
Tutorials www.ensembl.org/info/website/tutorials
Flash animations
www.youtube.com/user/EnsemblHelpdesk
http://u.youku.com/Ensemblhelpdesk
Email us [email protected]
Ensembl public mailing lists [email protected], [email protected]
Publications
Yates, A. et al
Ensembl 2016
Nucleic Acids Research
http://nar.oxfordjournals.org/content/early/2015/12/19/nar.gkv1157.full
Xosé M. Fernández-Suárez and Michael K. SchusterUsing the Ensembl Genome Server to Browse Genomic Sequence Data.Current Protocols in Bioinformatics 1.15.1-1.15.48 (2010)www.ncbi.nlm.nih.gov/pubmed/20521244
Giulietta M Spudich and Xosé M Fernández-SuárezTouring Ensembl: A practical guide to genome browsingBMC Genomics 11:295 (2010)www.biomedcentral.com/1471-2164/11/295
http://www.ensembl.org/info/about/publications.html