from variants to pathways: agilent genespring gx’s ... · pdf filecontextualization of...

6
From Variants to Pathways: Agilent GeneSpring GX’s Variant Analysis Workflow Technical Overview Introduction Next-generation sequencing (NGS) studies have created unanticipated challenges with respect to data mining and data storage as large numbers of genetic variants are reported from a single sequencing project. The scientific community has access to a plethora of tools for analyzing this data. Combining these tools to obtain biologically meaningful results is still a challenging task. While primary and secondary analysis can be automated, tertiary data exploration is largely done manually by a researcher (Figure 1). One of the outcomes of the tertiary analysis is a list of mutations identified from the secondary analysis. This information is usually stored in a Variant Call Format (VCF). The VCF has become an important template in modern biology since it is widely used to report variants. Typically, VCF files are flexible and are used to store all variant types including single nucleotide variants, insertions and deletions, copy number variants, and structural variants. Primary analysis • Production of sequence data and reads Secondary analysis • Alignment • QC • Variant calling on aligned data Tertiary analysis • Annotation and filtering of variants • Genome browser-driven exploration • Biological contextualization Agilent GeneSpring GX Figure 1. NGS analysis can broadly be categorized into three different parts. Primary and secondary analysis is computationally extensive, and is usually automated. Tertiary analysis is the exploration of biologically relevant data. GeneSpring GX now includes a variant analysis workflow that allows users to sort and compare VCF files, identify genes affected by a variation, and perform pathway analysis on affected genes. The workflow includes the steps in Figure 2. Figure 2. The variant analysis workflow in Agilent GeneSpring GX allows users to import a list of SNPs for tertiary data analysis. Import VCF Filter and sort variants Translate regions to genes Annotate and compare regions Identify genic regions Gene ontology analysis Pathway analysis

Upload: phamdieu

Post on 07-Mar-2018

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: From Variants to Pathways: Agilent GeneSpring GX’s ... · PDF filecontextualization of results, ... localization. To explore the underlying mechanism by which various DNA variants

From Variants to Pathways: Agilent GeneSpring GX’s Variant Analysis Workflow

Technical Overview

IntroductionNext-generation sequencing (NGS) studies have created unanticipated challenges with respect to data mining and data storage as large numbers of genetic variants are reported from a single sequencing project. The scientific community has access to a plethora of tools for analyzing this data. Combining these tools to obtain biologically meaningful results is still a challenging task. While primary and secondary analysis can be automated, tertiary data exploration is largely done manually by a researcher (Figure 1). One of the outcomes of the tertiary analysis is a list of mutations identified from the secondary analysis. This information is usually stored in a Variant Call Format (VCF). The VCF has become an important template in modern biology since it is widely used to report variants. Typically, VCF files are flexible and are used to store all variant types including single nucleotide variants, insertions and deletions, copy number variants, and structural variants.

Primary analysis• Production of sequence

data and reads

Secondary analysis• Alignment• QC• Variant calling on

aligned data

Tertiary analysis• Annotation and filtering of

variants• Genome browser-driven

exploration• Biological contextualization

Agilent GeneSpring GX

Figure 1. NGS analysis can broadly be categorized into three different parts. Primary and secondary analysis is computationally extensive, and is usually automated. Tertiary analysis is the exploration of biologically relevant data.

GeneSpring GX now includes a variant analysis workflow that allows users to sort and compare VCF files, identify genes affected by a variation, and perform pathway analysis on affected genes. The workflow includes the steps in Figure 2.

Figure 2. The variant analysis workflow in Agilent GeneSpring GX allows users to import a list of SNPs for tertiary data analysis.

Import VCF

Filter and sort variants

Translate regions to genes

Annotate and compare regions

Identify genic regions

Gene ontology analysis

Pathway analysis

Page 2: From Variants to Pathways: Agilent GeneSpring GX’s ... · PDF filecontextualization of results, ... localization. To explore the underlying mechanism by which various DNA variants

2

Key Functionalities and Benefits• Supports processed NGS data with

Variant call information in VCF format

• Enable simultaneous filtering of variants based on the variant associated information from the VCF file

• GeneSpring GX supports public and commercial databases including ClinVar, COSMIC, dbNSFP, and 1,000 Genomes. This information can be used for visualization and further analysis

• Powerful visualization options including elastic genome browser for interactive query of specific variant

• Perform multi-omic and inter-genomic analysis using various tools including pathway analysis and correlation analysis.

Figure 3. Agilent GeneSpring GX main view, showing the genome browser with its data and annotation tracks. Any track can be selected to display data as a spreadsheet.

Mutations are color-coded based on subtypes for easy visualization.

Data derived from the VCF analysis can be visualized as separate or merged tracks. Read coverage is plotted on the Y-axis.

Annotation files (for example TargetScan; CpG Islands) help in understanding the effect of mutation on transcripts.

Spreadsheet view of the VCF file, which can be sorted and copied to the clipboard.

Importing and Viewing VCF DataThis workflow supports VCF files that are exported from tools and portals such as 1000 Genomes (http://www.1000genomes.org/home), Agilent SureCall and Strand NGS. The workflow supports comparing VCF files to identify unique or common variants and can be viewed in the genome browser. Variant Analysis workflow in GeneSpring GX allows user to perform tertiary analysis by translating the effect of SNPs on biological pathways and overlay data in a multi-omics experiment.

The user can determine the effect of variants (SNPs, insertions, deletions, Copy Number Variations or structural variants) on genes, transcripts, as well as regulatory regions. VCF files imported in GeneSpring GX are stored within the tool for analysis. Each VCF file is stored as a Region List in the tool upon data import. These can be individually viewed in Genome Browser or a spreadsheet with its corresponding annotations. The drag-and-drop feature of the tool allows viewing of results as well as annotations. Figure 3 shows the default view in a SNP analysis workflow. Analyses can be easily performed to identify all variants common between VCFs, those that are unique to a given VCF, as well as variants that are commonly detected in all samples.

Page 3: From Variants to Pathways: Agilent GeneSpring GX’s ... · PDF filecontextualization of results, ... localization. To explore the underlying mechanism by which various DNA variants

3

Genomic information is increasingly used in prognosis and research that requires the need to visualize and analyze thousands of individuals and millions of variants. The variants analysis workflow in GeneSpring GX allows users to cluster variants on their zygosity score, allelic frequency, or any other value or tags that the VCF may have across various samples or VCF files. Figure 4 is an example of a hierarchical tree created to group regions on the column value derived from the VCF file.

Variant filteringThe Region List Operations workflow offers the ability to filter variants and the associated data. These options are used to include or exclude certain sites from any analysis being performed by the program. For example, users can remove poor quality variants and common polymorphisms, and categorize SNPs into smaller lists that can be saved as region lists in the experiment navigator. The tool can also be used, for example, to exclude genotypes from any analysis being performed by the program. GeneSpring GX also allows users to cluster a list of filtered regions. Filtered regions can be exported as a text, Browser Extensible Data (BED), or reference file.

Figure 4. A) Hierarchical tree showing 39,912 clustered regions; B) a zoomed-in view. Columns are labeled using the default VCF file columns on the left, and the labels on the top show the variant types. The figure legend shows the color code used for the labels. The color range is determined by the column used to cluster the regions.

A

B

Region color by variant typeComplexDeletionInsertionSubstitution

Color range

-126 -63 0 63 126

Com

plex

Com

plex

Com

plex

Subs

titut

ion

Com

plex

Com

plex

Com

plex

Com

plex

Com

plex

Com

plex

Com

plex

Com

plex

Com

plex

Com

plex

Com

plex

Com

plex

Com

plex

Com

plex

Subs

titut

ion

Com

plex

Subs

titut

ion

Subs

titut

ion

Com

plex

Com

plex

Com

plex

Page 4: From Variants to Pathways: Agilent GeneSpring GX’s ... · PDF filecontextualization of results, ... localization. To explore the underlying mechanism by which various DNA variants

4

Adding and Updating Publicly Available AnnotationsPublic annotation databases are available for download from Annotations Manager, as shown in Figure 5. VCF and BED files that list filtered and ranked variants can be saved as part of the Annotations Manager for a specific model organism. Data can be downloaded either from the Agilent server or the local desktop. This information can then be used to compare lists of mutations with annotated mutations derived from public sources (Figure 6), and viewed in the Genome Browser. Annotate Region List can be used to append additional information from another Region List in the experiment or annotation databases such as DNase clusters, GENCODE genes, and so forth. The Import Region List utility allows the user to import region-based annotations that can be curated to obtain filtered regions for downstream processing.

Figure 6. Agilent GeneSpring GX allows comparison of a source region list with a region list of choice in two different ways: either to find overlap or specify the maximum distance X (in bp) between two regions to be considered close to each other to compare regions in the variant analysis workflow.

Figure 5. Annotations Manager can store multiple builds for a given organism. Annotations for more than 30 different model organisms are available on the Agilent server for download, and custom annotations can be added for a specific build of a model organism.

Page 5: From Variants to Pathways: Agilent GeneSpring GX’s ... · PDF filecontextualization of results, ... localization. To explore the underlying mechanism by which various DNA variants

5

Results InterpretationFor biological interpretation and contextualization of results, GeneSpring GX provides the following options:

• Gene Ontology (GO)

• Pathway Analysis

• Multi-Omic Analysis

To identify genes and transcripts in a genomic region, GeneSpring GX takes a set of genome coordinates and retrieves a list of genes using Translate Regions To Genes. A desired flanking region can be set in the workflow. The result of this analysis is a list of genes that are near the selected Region List, within a certain distance (5,000 bp by default). For each gene, Find Genic Parts enables identification of exonic, intronic, upstream, and downstream regions based on user-selected transcript model (RefSeq, Ensemble, or UCSC).

Figure 8. MAP kinase pathway found to be significantly affected by mutations.

Enriched genes with mutations from 1,000 genomes VCF data

Differentially expressed genes from transcriptome experiment

Enriched genes from both experiments

Figure 7. Histogram plot showing a translated gene list of regions with a specific variant. The colors represent the genic part that contains a specific variant such as an insertion, deletion, and so forth.

UpstreamIntronicExonicDownstream

chr1

chr1

0

chr1

1

chr1

2

chr1

3

chr1

4

chr1

5

chr1

6

chr1

7

chr1

8

chr1

9

chr2

chr2

0

chr2

1

chr2

2

chr3

chr4

chr5

chr6

chr7

chr8

chr9

chrX

0

74..

14..

22..

29..

37..

Coun

ts

The translated gene list can then be an input to Gene Ontology analysis for identification of gene’s molecular function, biological processes, or cellular localization.

To explore the underlying mechanism by which various DNA variants affect a biological process, GeneSpring GX offers an overlay of translated genes on

pathways in a single omic as well as multi-omic analysis (Figure 8). A detailed discussion of the multi-omic analysis in the GeneSpring suite has been discussed elsewhere1. Users can query the list of genes against several pathway databases such as KEGG, BioCyc, and WikiPathways to identify statistically significant pathways that might be impacted by the variants identified in the study2.

Page 6: From Variants to Pathways: Agilent GeneSpring GX’s ... · PDF filecontextualization of results, ... localization. To explore the underlying mechanism by which various DNA variants

www.agilent.com/chem

For Research Use Only. Not for use in diagnostic procedures.

This information is subject to change without notice.

© Agilent Technologies, Inc., 2017 Published in the USA, September 25, 2017 5991-8301EN

ConclusionAgilent GeneSpring GX software is a powerful exploratory tool for the identification, filtering, and curation of variants affecting a biological function. It offers high-resolution interactive browsing of reference genomes as well as different types of genomic annotations derived from a variety of public databases across complex datasets. The intuitive and easy-to-use pathway analysis utility allows merging variant data with proteomics and metabolomics in a multi-omic setting, as well as inter-genomic analysis.

References1. Molecular Subtypes in Glioblastoma

Multiforme: Integrated Analysis Using Agilent GeneSpring and Mass Profiler Professional Multi-Omics Software, Agilent Technologies, publication number 5991-5505EN.

2. Correlation Analysis in Agilent GeneSpring and Mass Profiler Professional, Agilent Technologies, publication number 5991-5165EN.