inrich (interval enrichment test)

19
INRICH ( INRICH (IN INterval terval en enRICH RICHment ment Test) Test) INRICH ( INRICH (IN IN terval terval en enRICH RICH ment ment Test) Test) Phil H. Lee, PhD 2011-Oct-17 PI: Dr. Shaun Purcell Collaboration Work with: C O’Dushlaine, B Thomas Purcell Lab Psychiatric & Neurodevelopmental Genetics Unit Center for Human Genetics Research Massachusetts General Hospital, Harvard Medical School Stanley Center, The Eli and Edythe L. Broad Institute [email protected]

Upload: others

Post on 04-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

INRICH (INRICH (ININtervalterval enenRICHRICHmentment Test)Test)INRICH (INRICH (ININtervalterval enenRICHRICHmentment Test)Test)

Phil H. Lee, PhD

2011-Oct-17

PI: Dr. Shaun PurcellCollaboration Work with: C O’Dushlaine, B Thomas ,

Purcell LabPsychiatric & Neurodevelopmental Genetics Unit

Center for Human Genetics ResearchMassachusetts General Hospital, Harvard Medical SchoolStanley Center, The Eli and Edythe L. Broad Institute

[email protected]

ReadMeReadMeThis manual illustrates the usage of INRICH using a command line interface.

This manual assumes that• You have access to and familiar with PLINK. Otherwise, please refer to You have access to and familiar with PLINK. Otherwise, please refer to

relevant sections on the PLINK homepage as cited in the main text. • You have access to UNIX/LINUX/Cygwin with awk/gawk/shell for

generating/manipulating input files. Text editors on Windows (e.g., generating/manipulating input files. Text editors on Windows (e.g., notepad) can be used instead, but be cautious that column separators should be a valid whitespace (e.g., space, tab).

Please adhere to the formatting requirements for each input file. INRICH may not work properly if input files are incorrectly formatted nor give you proper warning nor give you proper warning.

I. Pathway AnalysisI. Pathway AnalysisGOALExamine enriched association signals for pre-defined sets of genetic variantsExamine enriched association signals for pre defined sets of genetic variants

ALGORITHMA : set of LD independent associated genomic intervalsA : set of LD-independent associated genomic intervals

For each gene set Si with Ni genes;1. Count the number of associated intervals in A overlapping with genes in Si.

Let’s denote the number Reali.

2. Generate R replicates, each with one random interval set selected, matching to A.

Empirical gene-set p-value P1 = % of R replicates with at least Reali number of random intervals overlapping with genes in Siintervals overlapping with genes in Si

3. Correct the empirical gene-set p-value using Bootstrapping-based re-sampling.

Corrected gene-set p-value P2 = % of B bootstrapping samples where the minimum gene-set p-value over all gene sets is at least as significant as P1

For more details, refer to Lee et al. 2011. Bioinformatics

I. Pathway AnalysisI. Pathway AnalysisINPUT

‐g reference gene fileg reference_gene_file‐m reference_SNP_file‐t target_gene_set_file‐a associated_interval_file

OUTPUTTwo enrichment P‐values for target gene sets (empirical/corrected)(empirical/corrected)One global enrichment P‐value(number of unique genes in nominally associated sets)(number of unique genes in nominally associated sets)

I. (1) Input File I. (1) Input File Reference Gene File

List of reference genes for a target organism g g g

Col 1 : Chromosome Col 2: Gene Start Base Pair Position Col 3: Gene End Base Pair PositionC l 4: Gene ID * sh ld be ni eCol 4: Gene ID * should be uniqueCol 5: Gene Symbol Col 6 and the following ones: Annotation (optional)

Generate own reference gene file or use the preformatted Entrez gene file downloadable at g p gthe INRICH homepage.

I. (1) Input File I. (1) Input File Reference SNP File

List of reference SNPs examined in the association studyy

Col 1 : Chromosome Col 2: SNP Base Pair Position

Reference SNP File Generation ExampleReference SNP File Generation Example

I. (1) Input File I. (1) Input File Target Gene Set File

List of target gene sets for enrichment testsg g

Col 1 : Gene ID * This id should match to Gene IDs in the reference gene file.Col 2: Target Gene Set IDCol 3 and the following ones: Annotation

Generate own gene set file or use the preformatted gene set files downloadable at the INRICH homepage.

If there exist public gene sets of interest which are not available at our homepage, let us p g p g ,know.

I. (1) Input File I. (1) Input File Associated Interval File

List of LD-independent associated interval regions for enrichment testsp g

Col 1 : ChromosomeCol 2: Interval Start Base Pair PositionCol 3: Interval End Base Pair Position

There could be multiple ways to generate LD-independent associated interval regions. In the next page, we illustrate two of such cases in SNP GWAS data.

I. (1) Input File I. (1) Input File Associated Interval File Case I Use PLINK LD clumping to generate LD independent genomic Case I. Use PLINK LD clumping to generate LD independent genomic intervals (http://pngu.mgh.harvard.edu/~purcell/plink/clump.shtml)1. Run PLINK ld clumping. 2 Generate an interval file from the clumped range file (adjust the optional clumping 2. Generate an interval file from the clumped range file (adjust the optional clumping

parameters accordingly if generated intervals are too big or too many!).

Example

I. (1) Input File I. (1) Input File Associated Interval File Case II Use Tag SNP information to generate LD independent genomic Case II. Use Tag SNP information to generate LD independent genomic intervals (http://pngu.mgh.harvard.edu/~purcell/plink/clump.shtml)1. Run PLINK SNP tagging. 2 Generate an interval file from the tag info file (adjust the optional taggining2. Generate an interval file from the tag info file (adjust the optional taggining

parameters accordingly if generated intervals are too big or too many!).

Example

I. (2) Running Enrichment TestsI. (2) Running Enrichment TestsThe four input files, summarized in the previous slides are requisite to run INRICH. Optional parameters, if not specified, will be run in default mode. Refer to the INRICH homepage for the most up to date parameter list homepage for the most up-to-date parameter list.

The following examples illustrates the use of some optional parameters.

E lExample:

I. (3) Output ListI. (3) Output ListGene set list with empirical p-value < 0.05*

Col 1 : Number of genes in gene set Col 2: Number of associated intervals overlapping with genes in gene setCol 3: Empirical enrichment p-value for gene set Col 4: Corrected enrichment p-value for gene setp gCol 5: Gene set id Col 6 and the following ones: Gene set annotation

Note that the number of genes in gene set may differ from the ones in gene set file as: i) only reference genes were considered; and ii) overlapping genes were merged genes were considered; and ii) overlapping genes were merged.

* p-value threshold can be adjusted by using optional argument -p.

I. (3) Output ListI. (3) Output ListGlobal enrichment p-valueI Are we observing a more number of unique genes in the enriched I. Are we observing a more number of unique genes in the enriched

pathways than would be expected by chance?

Col 1 : P-value threshold to select significantly enriched pathways Col 2: # of unique genes in the enriched pathways with empirical p-value ≤ P-value threshold Col 3: % of bootstrapping samples where # of unique genes in enriched Col 3: % of bootstrapping samples where # of unique genes in enriched pathways is not less than that of original association data

* Note that the first global enrichment p-value may be biased if enrichment pathways were selected mostly by the same genes (i.e., most are nested and/or related pathways). In such cases, the second y y g ( p y )global enrichment p-value may be a more reasonable metric to use.

II. Positional ClusteringII. Positional ClusteringGOALTo identify genomic regions with non-randomly clustered associated intervalsy g g y

ALGORITHMA : set of LD-independent associated genomic intervalsA : set of LD-independent associated genomic intervals

1. Identify the top N closest associated intervals

2 Generate R replicates each with one random interval set selected matching to A2. Generate R replicates, each with one random interval set selected, matching to A.

3. Calculate the empirical positional clustering p-value P1 for the k-th closest interval pair as % of R replicates where the distance between the k-th closest random interval pairs is not larger than that from the original k-th pair p g g p

II. Positional ClusteringII. Positional ClusteringINPUT

‐a associated interval filea associated_interval_file‐g reference_gene_file‐m reference_SNP_file‐t target_gene_set_file‐h number_of_top_closest_interval_pairs_to_examine

OUTPUTCl i i ifi P l f l i dClustering significance P‐values for top n closest associated intervals (empirical)

II. (1) Running Positional ClusteringII. (1) Running Positional ClusteringPositional clustering test is run with enrichment tests

E lExample

Examine top 100 closest associated intervals

II. (2) OutputII. (2) OutputTop N closest pairs of associated intervals

Col 1 : Closest Pair Rank C l 2 Closest Pair Interval I

Associated intervals are not non-randomly l d i i l b d Col 2: Closest Pair Interval I

Col 3: Closest Pair Interval 2Col 4: Distance between Interval I and 2Col 5: Clustering Significance P-value Col 6: Interval I Chromosomal Position

clustered on certain genomic locus based on clustering significance p-values

Col 6: Interval I Chromosomal PositionCol 7: Interval 2 Chromosomal Position

III. Using GUI to load Analysis OutputIII. Using GUI to load Analysis OutputOutput files from INRICH command line interface can be loaded into a GUI version by selecting the files in the Open File menu.

ReferencesReferencesINRICH: Lee et al. 2011 Bioinformatics (to be submitted). INRICH: Interval-based enrichment analysis for genome wide association studies. http://atgu.mgh.harvard.edu/inrich/

ALIGATOR ALIGATOR: Holmans et al. 2009 Am J Hum Genet 85(1):13-24. Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder.h // 004 k/ /http://x004.psycm.uwcm.ac.uk/~peter/

PLINK: http://pngu.mgh.harvard.edu/~purcell/plink/

Purcell Lab: http://pngu.mgh.harvard.edu/~purcell