publications - european bioinformatics institute · 6th sept variation data in ensembl and the...
TRANSCRIPT
• Ensembl training materials are protected by a CC BY license
http://creativecommons.org/licenses/by/4.0/
• If you wish to re-use these materials, please credit Ensembl for their creation
• If you use Ensembl for your work, please cite our papers
http://www.ensembl.org/info/about/publications.html
Training materials
Questions?○ We’ve muted all of your microphones
○ Join our Slack workspace and ask questions (link in your registration confirmation email)
○ My Ensembl colleagues will respond during the talk
○ Please reply @username to reply to a specific person
Emily Perry Astrid Gall
Course exercisesAll materials and exercises located here:
http://www.ebi.ac.uk/training/online/course/ensembl-browser-webinar-series-2016
A link to exercises and their solutions will appear in the
page hierarchy
This text will be replaced by a YouTube (link to YouKu too) video of the webinar
and a pdf of the slides.
The “next page” will be the exercises
Get help with the exercises
• Use the exercise solutions in the online course
• Join our Slack workspace and discuss the exercises with everybody in dedicated channels (register to get sent a link)
• Email us [email protected]
This webinar courseDate Webinar topic Instructor
4th Sept Introduction to Ensembl ✔
Ensembl genes ✔
Astrid Gall
Emily Perry
6th Sept Variation data in Ensembl and the Ensembl VEP
Comparing genes and genomes with Ensembl Compara
Erin Haskell
Astrid Gall
11th Sept Finding features that regulate genes – the Ensembl Regulatory Build
Data export with BioMart
Emily Perry
Erin Haskell
13th Sept Uploading your data to Ensembl
Introduction to the Ensembl REST APIs
Astrid Gall
Emily Perry
Session structurePresentation:Part 1: Ensembl variation dataPart 2: The Ensembl Variant Effect Predictor (VEP)
Demo:Part 1: Viewing variation data in the browserPart 2: Using the VEP
Exercises:Available on the train online site
Ensembl variation data- What types of variants are in Ensembl?
- Where does the data come from?
- What are the biological consequences of variants?
- Things to watch out for
The Ensembl Variant Effect Predictor (VEP) tool- What data can I use with the VEP?
- Identifying known variants
- Predicting consequences for novel variants
Session Overview
What types of variant are in Ensembl?
ensembl.org/info/genome/variation/index.html
Two broad categories:
1. Sequence variants (small alterations ≤50bp)
2. Structural variants (larger alterations ≥50bp)
Variant type 1: Sequence variants
● Single nucleotide polymorphisms (SNP/SNV)
ref...TTGACGTA...
alt...TTGGCGTA...
● Small insertions & deletions
ref...TTGACGTA... ins...TTGAGCGTA...del...TTG-CGTA...
indel...TTGGCTCGTA...
http://www.ensembl.org/info/genome/variation/prediction/classification.html
● Copy number variation (CNV)
● Inversion - nucleotide sequence inverted at same position
● Translocation - nucleotide sequence moved to a new position
Variant type 2: Structural variants
RefGainLoss
RefInvert
> > >> > >
RefTranslocated: same chromosomeTranslocated: diff chromosome
http://www.ensembl.org/info/genome/variation/prediction/classification.html
Where does the data come from?
Linked data
Quality control
Variant import
Ensembl analysis
The Ensembl variation process
Ensembl variation process: Import
Linked data
Quality control
Variant import
Ensembl analysis
Import variant data from
publicly available archives
and data repositories.
http://www.ensembl.org/info/genome/variation/species/sources_documentation.html
EVA
...and many many more
Data import: 23 species with variation data
http://www.ensembl.org/info/genome/variation/species/species_data_types.html
http://ensemblgenomes.org/info/genomes?variation=1
Division Number of species with variation data
Bacteria 0
Fungi 8
Metazoa 4
Plants 12
Protists 3
Data import: 27 species with variation data
Ensembl variation process: QC
Linked data
Quality control
Variant import
Ensembl analysis
● Mapping to reference assembly○ GRCh37 GRCh38
● Checks on alleles
● Checks for IUPAC ambiguity codes
● Excluding ‘suspect’ variants
http://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#quality_control
http://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
Ensembl variation process: Linked data
Linked data
Quality control
Variant import
Ensembl analysis
Import ‘accessory’ data
● Phenotype/disease
● Allele frequencies
● Publication data
CEU CHBJPT
LWKMSLASW
YRI
TSIMXL
GIHPUR
CLM
PEL
ACB
GW
D
IBR
GBRFIN
CHS
KHV
CDXPJL
Sequencing 2,500 individuals at 4X coverage
BEB
ITUSTU
ESN
Linked data: 1000 genomes project
America Africa Europe East Asia Central-South Asia http://www.internationalgenome.org
macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/
The Genome Aggregation Database provides allele frequency data from 7 different populations
Linked data: GnomAD allele frequencies
Sam
ple
nu
mb
er
Ensembl variation process: Analysis
Linked data
Quality control
Variant import
Ensembl analysis
Ensembl predicts:
● Variant consequences
● Protein function prediction
● Linkage disequilibrium data
● Variant conservation across species
http://www.ensembl.org/info/genome/variation/prediction/index.html
http://www.ensembl.org/info/genome/variation/prediction/predicted_data.html
Analysis: Variant consequence termsStandardised variant consequence terms as defined by
http://www.sequenceontology.org
http://www.ensembl.org/info/genome/variation/prediction/predicted_data.html
Analysis: Variant consequence termsStandardised variant consequence terms as defined by
http://www.sequenceontology.org
- For missense variants only
- Two prediction algorithms:
- SIFT (Sorting Intolerant From Tolerant)
- PolyPhen (Polymorphism Phenotyping)
Score changes in amino acid sequence based on:
- How conserved the amino acid is
- The chemical change in the amino acid
Analysis: Pathogenicity scores
ensembl.org/info/genome/variation/predicted_data.html#sift
SIFT1
0
0.05Deleterious
Tolerated
0
0.2
0.1
1Probably damaging
Benign
Possibly damaging
PolyPhen
Analysis: Pathogenicity scores
Analysis: Linkage disequilibrium
Linkage Disequilibrium (LD)
“the non-random association of
alleles at 2 or more loci within a given
population”
or
“how often two variants or specific
sequences are inherited together”
Analysis: Linkage disequilibrium
The Linkage Disequilibrium (LD) calculator
Within a genomic region...
For a list of variants...
For an defined area surrounding
your variant...
Where can I find this data?
● Website www.ensembl.org
● Variant Effect Predictor (VEP)
● BioMart
● Programmatically:
○ Perl API (including VEP)
○ REST API
Ensembl variation process
Linked data
Quality control
Variant import
Ensembl analysis
IM
CM
AL
BL
BL102
AL476
CM
553IM
768
AL476
AGTCGTAGCTAGCAAGGCCATAGGCGA
Frequency A = 0.01, frequency G = 0.99G is the ancestral alleleA causes disease susceptibility
A is allele in the contig used⸫ A is the reference allele⸫ G is the alternate allele⸫ Alleles are A/G
Note: Reference & alternate alleles
Note: Reference & alternate alleles
http://www.ensembl.org/Homo_sapiens/Variation/Population?db=core;r=12:120999079-121000079;v=rs1169305;vdb=variation;vf=829489
AGTCGTAGCTAGCT/GAGGCCATAGGCGA
TCGCCTATGGCCTA/CGCTAGCTACGACT
Exon sequence:TATGGCCTA/CGCTAGC
Alleles in database = T/GAlleles in gene = A/C
Alleles = A/C -ve strand or T/G +ve strand
Alleles = A/C or T/GOften lack further info
Note: Allele strand
Demonstration
- Finding variants in a gene of interest, MCM6
- Finding variants at a genomic location of interest
- Finding out more information about a specific variant, rs4988235
The Variant Effect Predictor
McLaren et al 2016 europepmc.org/abstract/MED/27268795
Your variant data
What does the VEP do?
• Affected gene, transcript
and protein sequence
• Splicing consequences
• Regulatory consequences
• Known variants:
+ Pathogenicity
+ Frequency data
+ Literature citations
A tool to predict and annotate the functional consequences of variants
/
What does the VEP do?
Variant data input formatsVariant coordinates(Ensembl default)
1 881907 881906 -/C +5 140532 140532 T/C +12 1017956 1017956 T/A +2 946507 946507 G/C +14 19584687 19584687 C/T -
HGVS notation ENST00000285667.3:c.1047_1048insC5:g.140532T>CNM_153681.2:c.7C>TENSP00000439902.1:p.Ala2233AspNP_000050.2:p.Ile2285Val
VCF #CHROM POS ID REF ALT20 14370 rs6054257 G A20 17330 . T A20 1110696 rs6040355 A G,T20 1230237 . T .
Variant IDs rs41293501COSM327779rs146120136FANCD1:c.475G>Ars373400041
http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#input
Are your variants are already known?
○ dbSNP○ COSMIC○ Clinvar○ ESP○ HGMD-Public○ Phencode
How common are your variant alleles in different populations?
○ 1000 Genomes○ ESP ○ ExAC projects○ GnomAD
Phenotype/disease, clinical significance○ OMIM○ Orphanet○ GWAS catalog○ ClinVar
VEP features: finding known variants
Consequence predictions (choose multiple databases)○ Ensembl○ RefSeq○ Merged○ GENCODE basic
Does your variant overlap regulatory regions?○ ENCODE
○ BLUEPRINT
○ NIH Epigenomics Roadmap
○ Can be limited to regulatory regions observed in specific cell types.
Pathogenicity predictions○ SIFT○ PolyPhen○ via plugins: CADD, FATHMM, LRT, MutationTaster, and many more!
VEP features: consequence prediction
Plugin info: http://www.ensembl.info/ecode/category/vep-plugins/
VEP features: plugins
Plugin info: http://www.ensembl.info/ecode/category/vep-plugins/
● Plugins add extra functionality to the VEP
● They may extend, filter or manipulate the output of the VEP.
● Plugins may make use of external data or code.
● Available on the web tool and with the script.
Use VEP with any species
http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html
● Access through the
web browser, REST
API or Perl API
● Use prebuilt caches
for Ensembl species.
...and for all species in
Use VEP with any species
http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html
● Speed up your VEP script with an offline cache.
● Or make your own from GTF and FASTA files - even for
genomes not in Ensembl.
Using VEP
ensembl.org/info/docs/tools/vep/index.html
We have identified four variants on human chromosome nine:- A deletion at 128328461 - C->A at 128322349- C->G at 128323079- G->A at 128322917
We will use the Ensembl VEP to find out:- Are any of my variants already known?- What genes are affected by my variants?- Do any of my variants affect gene regulation?
Demonstration
Questions?○ We’ve muted all of your microphones
○ Join our Slack workspace and ask questions (link in your registration confirmation email)
○ My Ensembl colleagues will respond during the talk
○ Please reply @username to reply to a specific person
Emily Perry Astrid Gall
Course exercisesAll materials and exercises located here:
http://www.ebi.ac.uk/training/online/course/ensembl-browser-webinar-series-2016
A link to exercises and their solutions will appear in the
page hierarchy
This text will be replaced by a YouTube (link to YouKu too) video of the webinar
and a pdf of the slides.
The “next page” will be the exercises
Get help with the exercises
• Use the exercise solutions in the online course
• Join our Slack workspace and discuss the exercises with everybody in dedicated channels (register to get sent a link)
• Email us [email protected]
This webinar courseDate Webinar topic Instructor
4th Sept Introduction to Ensembl ✔�
Ensembl genes ✔�
Astrid Gall
Emily Perry
6th Sept Variation data in Ensembl and the Ensembl VEP ✔�
Comparing genes and genomes with Ensembl Compara
Erin Haskell
Astrid Gall
11th Sept Finding features that regulate genes – the Ensembl Regulatory Build
Data export with BioMart
Emily Perry
Erin Haskell
13th Sept Uploading your data to Ensembl
Introduction to the Ensembl REST APIs
Astrid Gall
Emily Perry
Coming up!
Comparing genes and genomes with Ensembl Compara
Ensembl Compara allows you to perform detailed analysis of
gene models between species.
During this webinar we take a look at the gene trees and
homologues of a set of genes, and at whole genome alignments
between pairs and groups of species.
Starting in ∼5 minutes! Astrid Gall
• Ensembl training materials are protected by a CC BY license
http://creativecommons.org/licenses/by/4.0/
• If you wish to re-use these materials, please credit Ensembl for their creation
• If you use Ensembl for your work, please cite our papers
http://www.ensembl.org/info/about/publications.html
Training materials