investigate variation of chromatin interactions in human tissues hiren karathia, phd., sridhar...
TRANSCRIPT
Investigate Variation of Chromatin Interactions in Human Tissues
Hiren Karathia, PhD.,Sridhar Hannenhalli, PhD.,
Michelle Girvan, PhD.
Introduction of Hi-C experiment
In-silico Analysis
Objectives
• Develop a general pipeline for Hi-C data processing
• Detect gene-centric Hi-C interactions across different cell types
• Differentiate ubiquitous versus tissue specific gene-gene interactions
• Quantify spatial proximity of genes in pathways and quantify pathway proximity across multiple cell lines
• Investigate correlation between pathway proximity and pathway activity (approximated by expression of pathway genes)
• MORE….
Summary
• Outlining the Hind-III fragment distribution of Human Genome (Slide Number – 7 & 8) - These slides display numbers of in-silico Hind-III fragments (recognize AAGCTT) in the human genome. Downstream Hi-C analyses are based on these fragments.
• Hi-C data processing (Slide Number – 9 & 10) – List of samples processed. The crucial steps are normalization and filtration of the Hi-C interactions.– Filtration: Removal of technical biases from the Hi-C data. These biases include GC%, Ligation
Preferences (Self Ligations), unequal tag densities.– Normalization : Normalization is done with background calculation of expected Hi-C reads between
two given regions with assumption that interaction probability decreases with increasing distance between the two regions.
– Selection of Significant Interactions : Select the significant interactions based on difference between the observed number of reads and the expected number of reads (Odd Ratio) with significance cut-off (P-value : 0.001 & 0.05).
• Annotate the significant Hi-C interactions (Slide Number: 11) - Annotation of Hi-C interactions with Hg-19 Genomic features (Gene structures, Promoter, Intergenic & Non-coding regions).
• Non-redundant Genes in Hi-C interactions (Slide number: 12) - Select all annotated genes and promoters involved in a significant Hi-C interaction. The slide show the numbers of genes and promoters in replicates of all tissues.
• Non-redundant Hi-C Interactions across the tissues and replicates (Slide number: 13) – Hi-C interactions whose end-points are mapped on different genomic features in either replicates of all the tissues.
• Inter-tissue comparison of Hi-C interactions (Slide number : 14) - Merged all tissue replicate gene-gene Hi-C interactions and searched for interactions that are unique to single tissue and the those that are shared by pair of tissues (Figure-A). Figure-B shows number of gene-gene interactions commonly found in certain number of tissues (Figure-B).
• KEGG Pathways Analysis (Slide number : 15) – KEGG pathways with fewer than 5 annotated genes were excluded. Edge fraction was used to quantify spatial proximity of the gene in a pathway.
• Z-score distribution of the KEGG Pathways (Slide number : 16 & 17) – Edge fraction (and their z-score based on 500 length-controlled random gene sets) was calculated for ALL pathways in ALL cell types.
• Inter-tissue comparison of pathway proximity (Slide number : 18) – Unique and shared pathways with spatial proximity are shown for two z-score thresholds.
• Heat-map for the Pathways Hi-C analysis (Slide number : 19) – Heatmap shows Z scores of all the pathways in 6 tissues. The Pathways are clustered based on Manhattan distance of the Z-score vector.
Summary
• Finding Hi-C interactions at lower stringency : Since in few tissues read coverage is low, very few significant interactions are detected. We will repeat the analyses with a lower interaction significance cutoff (updated slide number 10)
• Processing RNA-Seq : There are 4 tissues for which matched RNA-Seq data are available. We will test the hypothesis that spatial proximity of pathways correlate with expression of pathway genes.
Future Work
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M
Frequency of Hind-III frag-ments
64395 71483 59636 58925 55089 51311 45517 43187 34643 38016 39326 38987 29412 25752 22933 19903 18939 22771 11542 15705 10089 7582 45074 7430 4
5000
15000
25000
35000
45000
55000
65000
75000
Frequency of Hind-III fragments
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Chromosome Length/Number of Hind-III fragment
Hind-III RE Sites on Annotated Hg19 Genome
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M
Reference 64395 71483 59636 58925 55089 51311 45517 43187 34643 38016 39326 38987 29412 25752 22933 19903 18939 22771 11542 15705 10089 7582 45074 7430 4
hESC-1 62899 70355 59015 58207 54124 50729 44519 42537 32456 37072 38765 38661 29160 25446 22125 19325 18567 22548 11381 15420 9970 7371 43571 699 4
hESC-2 62988 70476 59123 58369 54195 50804 44571 42609 32469 37161 38783 38673 29178 25459 22224 19349 18575 22594 11396 15538 9983 7357 43512 4986 4
5000
15000
25000
35000
45000
55000
65000
75000
No. of Hind-III fragments
ReferencehESC-1hESC-2
Distribution of RE sites in cell line sample
SampleFastq files
BWA SamtoolsBAM file Merged BAM file
Samtools
Samtools
Sorted BAM fileDe-duplicated file
Picard tool
Separate Hi-C interacting Reads
Samtools
SAM file
Select Significant Interactions
HOMER tools
Tissue ID Tissue Source DNA RNAHEK293 Kidney Cell Line (Replicate 1 & 2)
hESC Embryonic Stem Cell Line (Replicate 1 & 2) IMR90 Lung Fibroblast Cell Line (Replicate 1 & 2) BT483 Mammary Gland Cell Line (Replicate 1 & 2)
GM06990 B-Lymphocyte Cell Line (Replicate 1 & 2)RWPE1 Prostate Epithelial Cell Line (Replicate 1 & 2)
Annotate the Interactions
Normalize Hi-C reads
Hi-C data processing
Pathways Analysis Gene centric Analysis
In-house Python Scripts
HOMER tools
Normalization & Filtration of Hi-C interactions
N = estimated total number of reads n = estimated total number of interaction reads at each regionf = expected frequency of Hi-C reads as a function of distance
Select Significant Intra/Inter chromosomal interactions
Random InteractionsInteractions after Normalization and Filtrations process
Annotate the Interactions
chr1chr3chr5chr7chr9
chr11chr13chr15chr17chr19chr21
chrX
2500 7500 12500 17500 22500chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr1
0chr1
1chr1
2chr1
3chr1
4chr1
5chr1
6chr1
7chr1
8chr1
9chr2
0chr2
1chr2
2chrX chrY
3 UTR
236 134 172 103 148 149 98 76 70 103 117 131 62 93 78 50 112 49 40 44 41 55 43 0
5 UTR
22 13 16 10 8 23 6 13 11 10 10 19 9 14 12 8 13 2 3 11 5 6 3 0
Intergenic
9600
7780
8279
8473
7441
9382
5430
4955
4525
4625
3410
5915
7964
3792
3389
1658
2548
2630
838 1787
1754
1141
2350
23
TTS
269 148 175 93 115 174 104 79 84 110 96 134 71 102 77 60 137 31 63 43 32 64 27 0
exon
320 176 206 130 118 199 130 79 90 132 116 160 88 124 118 90 146 43 62 64 27 79 52 0
intron
8961
5978
7014
4877
4922
6225
5081
3645
3696
4828
3341
4821
4356
3423
3755
1729
3368
1841
793 1918
1296
1714
1408
2
non-coding
62 52 44 23 35 64 33 34 32 40 35 30 42 23 28 20 36 6 13 24 10 18 6 1
promoter-TSS
312 155 176 105 143 198 108 77 124 106 145 156 96 99 101 81 166 34 60 55 37 57 33 0
3 UTR 5 UTR Intergenic TTS exon intron non-coding promoter-TSS
Annotation of Hi-C interactions on Genomic Structures
i.e., HEK293 Tissue
Genes & Promoters in Hi-C interactions
Genomic features on end points of Hi-C interactions
Inter-tissue Hi-C gene-gene interactions
Figure – A
Figure – B
Diagonal values represent interactions unique to a tissue.
Other values represent interactions shared between 2 specific tissues
Tissues HEK293 IMR90 hESC BT483 GM06990 RWPE1
HEK293 19609
IMR90 23015 20608
hESC 17683 16965 6685
BT483 11371 11961 10498 3503
GM06990 8846 9036 9879 8958 1419
RWPE1 6555 6931 5771 4865 4025 1440
Pathways analysis
Evaluate Edge-fraction property for its statistical correlation with spatial proximity
E(f) = set of observed gene-gene interactions in a pathway
Ea(f) = possible gene-gene interactions of all the genes in a pathway
Z score of the Edge-Fractions calculated from randomly selected length-controlled genes
Pathways analysis for Gene-Gene Interactions
49456 Interactions 14524 Interactions 30356 Interactions
Pathways analysis for Gene-Gene Interactions
20018 Interactions 50841 Interactions 10088 Interactions
Inter-tissue Hi-C pathways interactions
Z-score >= 1
Z-score >= 2
Tissues HEK293 IMR90 hESC BT483 GM06990 RWPE1HEK293 2
IMR90 112 1
hESC 110 108 1
BT483 107 105 103 2
GM06990 109 109 114 104 1
RWPE1 72 73 73 75 73 0
Tissues HEK293 IMR90 hESC BT483 GM06990 RWPE1
HEK293 3
IMR90 76 4
hESC 71 72 7
BT483 61 64 60 0
GM06990 73 72 75 62 1
RWPE1 30 32 32 29 32 3
Heat-map for the Pathways Hi-C analysis