covprofile: profiling the viral genome and gene ... · covprofile: profiling the viral genome and...

13
CovProfile: profiling the viral genome and gene expressions of SARS-COV2 Yonghan Yu 1,, Zhengtu Li 2,, Yinhu Li 1,, Le Yu 3,, Wenlong Jia 1 , Feng Ye 2,* Shuai Cheng Li 1,* 1 City University of Hong Kong Shenzhen Research Institute 2 State Key Laboratory of Respiratory Disease, National Clinical Research Center for Respiratory Disease, Guangzhou Institute of Respiratory Health, the First Affiliated Hospital of Guangzhou Medical University, Guangzhou 510120, China 3 Beijing YuanShengKangTai (ProtoDNA) Genetech Co. Ltd., Beijing 100190, China The authors contributed equally to the work. *[email protected], *[email protected] Abstract SARS-COV2 is infecting more than one million people. Knowing its genome and gene expressions is essential to understand the virus. Here, we propose a computational tool to detect the viral genomic variations as well as viral gene expressions from the sequenced amplicons with Nanopore devices of the multiplex reverse transcription polymerase chain reactions. We applied our tool to 11 samples of terminally ill patients, and discovered that seven of the samples are infected by two viral strains. The expression of viral genes ORF1ab gene, S gene, M gene, and N gene are high among most of the samples. While performing our tests, we noticed the abundances of MUC5B transcript segments to be consistently higher than those of other genes in the host. Introduction 1 Severe acute respiratory syndrome coronavirus 2 (SARS-COV-2) has infected more than one million 2 people, and has led to more than 80,000 deaths, at the time of preparation of this 3 manuscript [1, 23, 25, 28]. The virus is an enveloped and single-stranded RNA betacoronavirus 30kb 4 base-pairs (bps), , which belongs to the family Coronaviridae [6,27]. Since the year 2000, we have 5 witnessed and experienced three highly widespread pathogenic coronaviruses in human populations — 6 the other two viruses are SARS-COV in 2002-2003, and MERS-CoV in 2012 [4]. All three viruses can 7 lead to Acute Respiratory Distress Syndrome (ARDS) in the human host, which may cause pulmonary 8 fibrosis,and lead to permanent lung function reduction or death [2, 20, 22]. The mechanisms of these 9 severe cases remain unclear. 10 No drug is currently confirmed to be effective against SARS-COV-2. Some vaccine candidates are 11 under intensive studies [12]. A nucleotide analog RNA-dependent RNA Polymerase (RdRP) inhibitor 12 called remdesivir has demonstrated effects on SARS-COV-2 [19]. Hoffmann et al. suggested that 13 TMPRSS2 inhibitor approved for clinical use can be adopted as a treatment option [9]. In silico methods 14 have as well produced lists of potential drugs [7] and epitopes [16] worthy of further exploration. 15 However, short of details on how the viral molecule compromises the cells of hosts, it is hard to prioritize 16 our exploration of the huge array of potential drugs and therapies. Analysis of the mechanism can 17 provide us insights on the viral transmission as well as revealing therapeutic targets. 18 However, there are many difficulties in analyzing the virus. Isolating the virus can be tedious and 19 time-consuming. The samples usually carry low viral load, resulting in false negatives. To reduce the 20 1/13 . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted April 6, 2020. . https://doi.org/10.1101/2020.04.05.026146 doi: bioRxiv preprint

Upload: others

Post on 02-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CovProfile: profiling the viral genome and gene ... · CovProfile: profiling the viral genome and gene expressions of SARS-COV2 Yonghan Yu 1, y, Zhengtu Li2,, Yinhu Li , Le Yu3,,

CovProfile: profiling the viral genome and gene expressions ofSARS-COV2

Yonghan Yu1,†, Zhengtu Li2,†, Yinhu Li1,†, Le Yu3,†, Wenlong Jia1, Feng Ye2,* Shuai Cheng Li1,*

1 City University of Hong Kong Shenzhen Research Institute2 State Key Laboratory of Respiratory Disease, National Clinical Research Center forRespiratory Disease, Guangzhou Institute of Respiratory Health, the First AffiliatedHospital of Guangzhou Medical University, Guangzhou 510120, China3 Beijing YuanShengKangTai (ProtoDNA) Genetech Co. Ltd., Beijing 100190, China† The authors contributed equally to the work.

*[email protected], *[email protected]

Abstract

SARS-COV2 is infecting more than one million people. Knowing its genome and gene expressions isessential to understand the virus. Here, we propose a computational tool to detect the viral genomicvariations as well as viral gene expressions from the sequenced amplicons with Nanopore devices of themultiplex reverse transcription polymerase chain reactions. We applied our tool to 11 samples ofterminally ill patients, and discovered that seven of the samples are infected by two viral strains. Theexpression of viral genes ORF1ab gene, S gene, M gene, and N gene are high among most of the samples.While performing our tests, we noticed the abundances of MUC5B transcript segments to be consistentlyhigher than those of other genes in the host.

Introduction 1

Severe acute respiratory syndrome coronavirus 2 (SARS-COV-2) has infected more than one million 2

people, and has led to more than 80,000 deaths, at the time of preparation of this 3

manuscript [1, 23,25,28]. The virus is an enveloped and single-stranded RNA betacoronavirus 30kb 4

base-pairs (bps), , which belongs to the family Coronaviridae [6, 27]. Since the year 2000, we have 5

witnessed and experienced three highly widespread pathogenic coronaviruses in human populations — 6

the other two viruses are SARS-COV in 2002-2003, and MERS-CoV in 2012 [4]. All three viruses can 7

lead to Acute Respiratory Distress Syndrome (ARDS) in the human host, which may cause pulmonary 8

fibrosis,and lead to permanent lung function reduction or death [2, 20,22]. The mechanisms of these 9

severe cases remain unclear. 10

No drug is currently confirmed to be effective against SARS-COV-2. Some vaccine candidates are 11

under intensive studies [12]. A nucleotide analog RNA-dependent RNA Polymerase (RdRP) inhibitor 12

called remdesivir has demonstrated effects on SARS-COV-2 [19]. Hoffmann et al. suggested that 13

TMPRSS2 inhibitor approved for clinical use can be adopted as a treatment option [9]. In silico methods 14

have as well produced lists of potential drugs [7] and epitopes [16] worthy of further exploration. 15

However, short of details on how the viral molecule compromises the cells of hosts, it is hard to prioritize 16

our exploration of the huge array of potential drugs and therapies. Analysis of the mechanism can 17

provide us insights on the viral transmission as well as revealing therapeutic targets. 18

However, there are many difficulties in analyzing the virus. Isolating the virus can be tedious and 19

time-consuming. The samples usually carry low viral load, resulting in false negatives. To reduce the 20

1/13

.CC-BY 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted April 6, 2020. . https://doi.org/10.1101/2020.04.05.026146doi: bioRxiv preprint

Page 2: CovProfile: profiling the viral genome and gene ... · CovProfile: profiling the viral genome and gene expressions of SARS-COV2 Yonghan Yu 1, y, Zhengtu Li2,, Yinhu Li , Le Yu3,,

false negatives, one approach is to amplify the viral sequences with multiplex reverse transcription 21

polymerase chain reaction (RT-PCR), followed by sequencing the amplicons with third generation 22

sequencing technologies such as the Nanopore devices. The approach has been applied to sequence the 23

dengue [21] and Zika viruses [17], as well as in the SARS-COV-2 viruses [14]. 24

Nanopore sequencing generate long reads, which enable the detection of large genomic structural 25

changes, viral recombination and viral integration into host genome. They also supports alternative 26

splicing analyses of microorganism studies [3]. However, The long reads suffers high error rates. 27

Detection of the viral and host gene expressions using amplicons is further complicated by several factors. 28

First, the sub-sequences can be amplified unevenly. Second, reads in the gene region amplified from the 29

viral genome are indistinguishable from the gene transcripts. Third, chimeric reads are frequent. 30

Here, we developed a computational tool, CovProfile, to detect viral genomic variations, and profile 31

viral gene expressions through data from multiplex RT-PCR Amplicon Sequencing on Nanopore 32

GridION. To detect the genomic variations, we mapped the reads to SARS-CoV-2 reference genome 33

(MN908947.3), and to profile the viral gene expressions, we adopt a regularization method to find out 34

the primer efficiencies, We have tested the tool on the data obtained from the sputum samples in eleven 35

severe COVID-19 patients. We found seven out of eleven patients are infected by two strains, and 36

discovered no structural variations. We found that the expression of viral genes ORF1ab gene, S gene, M 37

gene, and N gene are high among most of the samples. While performing our tests, we noticed the 38

abundances of MUC5B transcript segments to be consistently higher than those of other genes in host. 39

The function of this gene may be associated with the severe symptom of pulmonary fibrosis. 40

Method and Materials 41

Data 42

In this study, we collected Nanopore sequencing data for SARS-CoV-2 virus from the sputum samples of 43

11 COVID-19 patients. A total of 1,804,043 clean reads were generated, with an average of 44

164,004.00±90,848.28 (Mean±SD) reads per sample (Supplementary Figure 1). Aligning the reads to 45

SARS-CoV-2 and human genomes respectively, the alignment ratio ranged from 20.14% to 99.74% on 46

SARS-CoV-2 genome, and ranged from 0.11% to 72.12% on the human genome. Furthermore, each 47

sample contained 71.24±35.36M data on average, while 61.29±42.13M and 8.87±9.34M data were 48

aligned to SARS-CoV-2 and host genomes, respectively. 49

0

100000

200000

sample01

sample02

sample03

sample04

sample05

sample06

sample07

sample08

sample09

sample10

sample11

Reads

numb

er

virus aligned reads

host aligned reads

other reads

Figure 1. Compositions of the data sets. The heights of the columns represented the composition ofdsta. The data came from the host, SARS-CoV-2 and other genomes respectively, which were representedby blue, orange and grey colors.

2/13

.CC-BY 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted April 6, 2020. . https://doi.org/10.1101/2020.04.05.026146doi: bioRxiv preprint

Page 3: CovProfile: profiling the viral genome and gene ... · CovProfile: profiling the viral genome and gene expressions of SARS-COV2 Yonghan Yu 1, y, Zhengtu Li2,, Yinhu Li , Le Yu3,,

Remove false-positive chimeric reads 50

Chimeric nanopore reads formed by random connection of multiplex RT-PCR amplicons were split into 51

multiple segments if the location of their bilateral ends matches to corresponding locations of PCR 52

primer pair. These chimeric reads could be split into several segments with their bilateral ends exactly 53

matching the sites where the multiplex PCR primers are located, suggesting that they should be 54

processed accurately to avoid false positive identification of virus recombinations or host integrations. 55

We position primers on both viral and host genome, and splits the identified chimeric reads into segments 56

corresponding to PCR amplicons. This method rescues lots of sequencing data, providing more accurate 57

alignment and higher coverage. Realignment were performed on the separated segments before analysis. 58

Mutation Detection 59

We aligned the clean data to SARS-CoV-2 genome (NCBI MN908947.3) with minimap2 [11] with default 60

parameters for Oxford Nanopore reads. With the alignment results, the genomic variations with average 61

quality larger than 10 and coverage depth larger than 20 were called with bcftools (version 1.8). To 62

reduce the effects brought about by the PCR amplification, a variation is filtered if it was located within 63

10 bp upstream or downstream of the primer region, and occurred only in one sample. 64

Estimating viral load and primer efficiency 65

Assume that there are n primer pairs (si, ti), with primer efficiency ei, 1 ≤ i ≤ n. Let the viral gene 66

boundaries be denoted (lg, rg), where g indexes the genes. Denote the observed sequence depth captured 67

by the primer pair (si, ti) as di. 68

We want to estimate the viral gene expression. Let the number of viral copies be denoted v, the 69

number of transcripts of gene g be denoted tg, and the sequencing rate of amplified segment be denoted 70

α. Then, we can have the following two approximations for di. 71

If the primer i is fully covered by gene g (Fig1. A), then as the number of amplicons increases 72

exponentially with the number of replication cycles we have 73

di ≈1

α(v + tg)e

kg . (1)

where k is the number of replication cycles. 74

Similarly, if primer pair i spans across gene boundaries; that is, it belongs to none of the gene 75

transcripts (Fig1. B), then we have 76

di ≈1

αvekg . (2)

If some reads share two ends, where one end is a primer and the other end is at some gene boundary 77

(Fig 1D, Fig 1E), it may indicate that these amplicons are the result of linear amplification. Denote the 78

number of reads ri,g with two endpoints as (ti, eg), and the number of reads rg,i with two endpoints as 79

(li, sg), then 80

rg ≈ 1

2αktgeg (3)

If a gene g is within some primer target region (Fig2. G), then the sequenced fragment can be 81

expected to exactly cover the gene. Let the number of observed reads span exactly the whole gene g as 82

rg, then 83

rg ≈ 1

αtg. (4)

To find the expressions of the viral genes, we want to minimize the following objective function based 84

on the entities defined above 85∑(log (v + ti) + k log ei − logα− log di)

2 +

3/13

.CC-BY 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted April 6, 2020. . https://doi.org/10.1101/2020.04.05.026146doi: bioRxiv preprint

Page 4: CovProfile: profiling the viral genome and gene ... · CovProfile: profiling the viral genome and gene expressions of SARS-COV2 Yonghan Yu 1, y, Zhengtu Li2,, Yinhu Li , Le Yu3,,

∑(log v + k log ei − logα− log di)

2 +∑(log k + log ti + log

ei2− log ri logα)

2 +∑λ1 log

2 ei + λ2 log2 v,

where λ1 and λ2 are two regularization parameters, and k is a weight factor (we set k to 21 according to 86

our experiments). The viral load v, primer efficiencies ei, and expressions tg which minimize the 87

objective equation are our estimation for these parameters. 88

Viral Genome

Transcript

l r

Primer Pair

s t

Read

5’ 3’

A

Viral Genome

Transcript

l r

Primer Pair

s t

Read

5’ 3’

Viral Genome

Transcript

l r

Primer Pair

s t

Read

5’ 3’

Viral Genome

Transcript

l r

Primer Pair

s t

Read

5’ 3’Viral Genome

Transcript

l r

Primer Pair

s t

Read

5’ 3’

B

C D

E

Viral Genome

Transcript

l r

Primer Pair

s t

Read

5’ 3’l' r '

Viral Genome

Transcript

l r

Primer Pair

Read

5’ 3’

F

G

Figure 2. Relations between viral gene, primer pair and sequence. A: Primer pair spans fullyon gene with sequence aligned with primers. B: Primer pair spans across gene boundaries at 5’ withsequence aligned with primers. C: Primer pair spans across gene boundaries at 3’ with sequence alignedwith primers. D: Primer pair spans across two gene with sequence aligned with primers. E: Primer pairspans across gene boundaries at 5’ with sequence aligned with gene boundaries at 3’ and primer at 5’. F:Primer pair spans across gene boundaries at 3’ with sequence aligned with gene boundaries at 5’ andprimer at 3’. G: Sequence aligned with gene boundaries at both 5’ and 3’.

Calculate viral gene expression 89

To calculate the viral gene expression, the duplication generated during PCR were removed based on the 90

estimated primer efficiencies. The gene expression can be calculated as the sum of the base depth of each 91

gene divided by its length. 92

Capture gene product from host 93

To capture the gene products of the host, reads and primers were aligned to the hg38 transcript database 94

derived from UCSC refGene annotation. Primer alignment were performed with BLAT [10]. Regions 95

where the depth of reads alignment are equal or larger to 5 are screened. Reads with endpoints matching 96

the screened primer alignment were treated as PCR duplications. Due to mismatches in primer to 97

4/13

.CC-BY 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted April 6, 2020. . https://doi.org/10.1101/2020.04.05.026146doi: bioRxiv preprint

Page 5: CovProfile: profiling the viral genome and gene ... · CovProfile: profiling the viral genome and gene expressions of SARS-COV2 Yonghan Yu 1, y, Zhengtu Li2,, Yinhu Li , Le Yu3,,

transcript alignment, primer efficiencies may change when capturing gene product. Thus, the recognized 98

PCR duplication were removed and one copy of those reads were kept. The gene expression from host 99

can then be estimated as the sum of the base depth of each transcript divided by its length. 100

Results 101

Distribution of SNPs in SARS-CoV-2 genome 102

After aligning the sequencing data to the SARS-CoV-2 genome, we detected the mutations of 103

SARS-CoV-2 in the 11 samples3. Based on the mutations of SARS-CoV-2 in 11 samples, we discovered 104

38 SNPs (Single Nucleotide Polymorphisms) in total; 35 of them are located on the genetic regions, 105

including gene ORF1ab, S, N and M. Gene ORF1ab contained 23 SNPs in 11 samples, and 17 of them 106

were non-synonymous mutations. Non-synonymous mutations also occupied a large part of SNPs in the 107

gene S, N and M: 5 of 8 SNPs in gene S, 1 of 3 SNPs in gene N, and 1 of 1 SNPs in gene M were 108

non-synonymous mutations. Since most of the SNPs were located in only 1 sample, we deduced that the 109

functions of the discovered SNPs need to be further verified according to lab experiments. Surprisingly, 110

we also found that 7 among 11 samples were heterogeneous at loci 8782. This may imply that the hosts 111

have two different virus strains. However, we did not find creditable InDels (Insertions and Deletions), or 112

structural variations, viral-host integration during the mutation analysis. 113

Evaluations on the amplification efficient for the primers 114

Since 98 pairs of primers from two pools were designed for the enrichment of SARS-CoV-2 genome 115

before sequencing, we discovered that different parts of SARS-CoV-2 genome exhibited different coverage 116

depths. We then calculated the amplification efficiencies for the primers with our proposed method. We 117

discovered that the primers located at the 5’ and 3’ ends of the SARS-CoV-2 genome exhibited higher 118

amplification efficiency (>1.2, Figure 4), including primer01 and primer84-primer93. With the genomic 119

locations of the primers, we know that the genomic sequences within the ranges 24-410 and 120

25,280-28,464 can be amplified with more sequencing data. On the other hand, the remaining primers 121

exhibited slightly lower efficiencies, within the range of 1.0-1.2. For primers 64, 70, 83 and 85, their 122

amplification efficiencies were as low as 1, which indicates that the target regions of these primers were 123

not amplified. Based on the primer efficiencies, the effect of data bias which was induced by the PCR 124

experiments can be attenuated from the original data, and they can be further applied for the 125

normalization of viral gene expression under different primers. 126

Normalized coverage depth on SARS-CoV-2 genome 127

Due to the amplification discrepancy among the primers, the sequencing data was unevenly distributed 128

along the SARS-CoV-2 genome. After aligning the sequencing data to the SARS-CoV-2 genome, we 129

found that the coverage depths of some genomic regions reached over 8000x in some samples, but some 130

regions have coverage depths less than 10x (Figure 5). In order to minimize the data imbalance induced 131

by PCR amplification, we performed data normalization based on the primer efficiency and their 132

locations on SARS-CoV-2 genome, so that the normalized data could reflect the actual virus 133

reproduction in the host. Based on the algorithm illustrated in the method, we obtained the normalized 134

sequencing depth of 11 sample data. 135

After data normalization, the overall coverage depth along the SARS-CoV-2 genome was relatively 136

even, and less than 300x. In addition, we discovered that the ORF1ab gene, S gene, M gene, and N gene 137

exhibited higher coverage depth in the samples, which might indicate higher expression (Figure 5). 138

These genes are related to host cell recognition, amplification and virus particle assembly, so we 139

speculate that the results might explain the severe SARS-CoV-2 virus infection in the patients. Since 140

gene E, ORF6, ORF7a, ORF7b, ORF8 and ORF10 are less than 400bp in length, their transcripts might 141

not have been captured by the primers, and hence their expressions cannot be estimated. 142

5/13

.CC-BY 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted April 6, 2020. . https://doi.org/10.1101/2020.04.05.026146doi: bioRxiv preprint

Page 6: CovProfile: profiling the viral genome and gene ... · CovProfile: profiling the viral genome and gene expressions of SARS-COV2 Yonghan Yu 1, y, Zhengtu Li2,, Yinhu Li , Le Yu3,,

1 2990310000 20000

E

M

NORF10

ORF3a

ORF6

ORF7a

ORF8

SORF1ab

1 2990310000 20000SARS CoV 2

sample01

sample02

sample03

sample04

sample05

sample06

sample07

sample08

sample09

sample10

sample11

1906

0:G>A

227:G

>A

2340

3:A>G

241:C

>T

2670

9:G>A

3037

:C>T

1153

5:G>T

1532

4:C>T

2188

9:T>C

2271

3:C>T

8782

:C>T

2267

8:A>G

2317

3:A>G

2909

5:C>T

8782

:C>T

2276

3:G>C

2649

7:T>C

4788

:C>T

8782

:C>T

1532

4:C>T

1755

0:C>T

1756

7:G>A

2670

9:G>A

2891

6:G>A

1140

5:G>C

2670

9:G>A

7472

:G>A

865:C

>T

8782

:C>T

1083

3:C>T

227:G

>A

8782

:C>T

1510

4:G>T

2909

5:C>T

8782

:C>T

1183

7:G>T

1756

7:G>A

2011

2:A>T

2670

9:G>A

2909

5:C>T

7472

:G>A

8782

:C>T

1020

3:T>C

1182

7:A>G

2248

2:C>T

227:G

>A

2311

2:T>C

2670

9:G>A

2891

6:G>A

2952

7:G>A

6819

:G>T

6996

:T>C

7595

:T>C

8151

:T>C

9131

:C>T

227:G

>A

2271

3:C>T

2670

9:G>A

5512

:C>T

Figure 3. Distributions of High-frequency SNPs among the 11 COVID-19 samples. Thelocations of genes in SARS-CoV-2 were exhibited on top of the plot, and the distributions of SNPs werespecified for the 11 samples.

6/13

.CC-BY 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted April 6, 2020. . https://doi.org/10.1101/2020.04.05.026146doi: bioRxiv preprint

Page 7: CovProfile: profiling the viral genome and gene ... · CovProfile: profiling the viral genome and gene expressions of SARS-COV2 Yonghan Yu 1, y, Zhengtu Li2,, Yinhu Li , Le Yu3,,

1.0 1.1 1.2 1.3 1.4 1.5

Primer01Primer02Primer03Primer04Primer05Primer06Primer07Primer08Primer09Primer10Primer11Primer12Primer13Primer14Primer15Primer16Primer17Primer18Primer19Primer20Primer21Primer22Primer23Primer24Primer25Primer26Primer27Primer28Primer29Primer30Primer31Primer32Primer33Primer34Primer35Primer36Primer37Primer38Primer39Primer40Primer41Primer42Primer43Primer44Primer45Primer46Primer47Primer48Primer49Primer50Primer51Primer52Primer53Primer54Primer55Primer56Primer57Primer58Primer59Primer60Primer61Primer62Primer63Primer64Primer65Primer66Primer67Primer68Primer69Primer70Primer71Primer72Primer73Primer74Primer75Primer76Primer77Primer78Primer79Primer80Primer81Primer82Primer83Primer84Primer85Primer86Primer87Primer88Primer89Primer90Primer91Primer92Primer93Primer94Primer95Primer96Primer97Primer98

Primer efficient

Primer poolsPool1

Pool2

Figure 4. Amplification efficient for the 98 pairs of primers. The amplification efficient wascalculated for the primers, and their distributions were detected in 11 samples. The primers in pool1 andpool2 were marked with red and blue colors respectively.

7/13

.CC-BY 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted April 6, 2020. . https://doi.org/10.1101/2020.04.05.026146doi: bioRxiv preprint

Page 8: CovProfile: profiling the viral genome and gene ... · CovProfile: profiling the viral genome and gene expressions of SARS-COV2 Yonghan Yu 1, y, Zhengtu Li2,, Yinhu Li , Le Yu3,,

Original sequencing depth Normalized sequencing depth

0 10000 20000 30000 0 10000 20000 30000

0

10

20

30

40

0

2000

4000

6000

8000

sample

01

Original sequencing depth Normalized sequencing depth

0 10000 20000 30000 0 10000 20000 30000

0

5

10

15

20

25

0

500

1000

sample

02

Original sequencing depth Normalized sequencing depth

0 10000 20000 30000 0 10000 20000 30000

0

5

10

15

0

300

600

900

1200

sample

03

Original sequencing depth Normalized sequencing depth

0 10000 20000 30000 0 10000 20000 30000

0

5

10

15

20

0

2000

4000

6000

sample

04

Original sequencing depth Normalized sequencing depth

0 10000 20000 30000 0 10000 20000 30000

0

10

20

30

40

50

0

2000

4000

6000

8000

sample

05

Original sequencing depth Normalized sequencing depth

0 10000 20000 30000 0 10000 20000 30000

0

30

60

90

0

2000

4000

6000

8000

sample

06

Original sequencing depth Normalized sequencing depth

0 10000 20000 30000 0 10000 20000 30000

0

100

200

0

2000

4000

6000

8000

sample

07

Original sequencing depth Normalized sequencing depth

0 10000 20000 30000 0 10000 20000 30000

0

10

20

30

0

500

1000

1500

sample

08

Original sequencing depth Normalized sequencing depth

0 10000 20000 30000 0 10000 20000 30000

0

10

20

30

40

50

0

2000

4000

6000

8000

sample

09

Original sequencing depth Normalized sequencing depth

0 10000 20000 30000 0 10000 20000 30000

0

10

20

30

0

2000

4000

6000

sample

10

Original sequencing depth Normalized sequencing depth

0 10000 20000 30000 0 10000 20000 30000

0

20

40

0

2000

4000

6000

8000

sample

11

E

M

NORF10

ORF3a

ORF6

ORF7a

ORF8

SORF1ab

1 2990310000 20000

E

M

NORF10

ORF3a

ORF6

ORF7a

ORF8

SORF1ab

1 2990310000 20000

Figure 5. Original and normalized coverage depth for the 11 samples. The locations of genesin SARS-CoV-2 were exhibited on top of the plot, and they were marked by different colors. Each samplecontained two plots, which represented the original (left) and normalized coverage (right) depths for thebases along the SARS-CoV-2 genome respectively. The coverage depth were colored by the genes inSARS-CoV-2.

8/13

.CC-BY 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted April 6, 2020. . https://doi.org/10.1101/2020.04.05.026146doi: bioRxiv preprint

Page 9: CovProfile: profiling the viral genome and gene ... · CovProfile: profiling the viral genome and gene expressions of SARS-COV2 Yonghan Yu 1, y, Zhengtu Li2,, Yinhu Li , Le Yu3,,

High expression genes in host 143

Previous studies on the sequencing data have shown that it contains the human genome, so we detected 144

the corresponding genes and their expression levels. Aligning these data to human transcript database, 145

we identified many segments of human gene transcripts. Subsequently, the coverage depths of these genes 146

were normalized based on the primer efficiencies. Only several genes are listed within the top 5% in more 147

than five samples according to the normalized depth. They include the genes MUC5B, TCAF12, KRT12 148

and etc. MUC5B is the only gene consistently within the top 5% across all the samples. More than 50% 149

of the MUC5B gene region is covered (Figure 5). However, the region with high coverage in MUC5B 150

gene is inconsistent among the samples, perhaps due to data scarcity. 151

Conclusion and Discussion 152

In this study, we implemented a software tool, CovProfile, to analyze multiplex RT-PCRs on Nanopore 153

sequencing data, and studied the sputum samples of COVID-19 patients. Our results show enrichment of 154

PCR amplicons which totally cover 99.7% of SARS-CoV-2 virus genome, revealing that the multiplex 155

RT-PCRs are a reliable method in identification of COVID-19 infection. Interestingly, the variant 156

(26709:G>A) is identified in half of samples, and shows consistent heterozygosity (alteration frequency: 157

19.36%-26.15%) in all relevant samples. This suggests at least two virus strains in the patients. 158

The data contained large amount of chimeric reads formed by unintended random connection of 159

multiplex PCR amplicons, making up 1.69% of total sequencing reads. Similar observation has been 160

reported that these chimeric reads can be formed during library preparation and sequencing, and lead to 161

the cross barcode assignment errors in multiplex samples [24,26]. We applied CovProfile on the data and 162

recovered majority of chimeric reads. 163

The sequencing reads could be used to evaluate the expression of SARS-CoV-2 genes. However, the 164

coverage unevenness caused by efficiency differences in the multiplex primers impedes our calculation of 165

pathogenic genes expression. To solve this problem, we propose a multi-parameter method which 166

considers the primer efficiency, gene differential expressions, and RNA virus genome copy number 167

simultaneously. Different locations on the viral genome have distinct composition of molecule sources, 168

including RNA genome, expressed viral transcript, and PCR amplicons. According to the gene locations 169

along the viral genome, we enumerate reads types based on aligned positions, and deduce the 170

contributions of different sources to the local region depth. By applying this method on our data, we 171

find that viral gene ORF1ab may maintain higher expression than other genes across most samples. This 172

viral gene occupied a large region on SARS-CoV-2 genome, and it responsible for the amplification and 173

assembly of the virus [15]. For the primer, the multiplex primers efficiency has wide distribution in single 174

sample, and the distributions among samples demonstrate similar trend. This suggests that our 175

estimation of primers efficiency comply to the reality. 176

We utilize the sequencing data to evaluate host genome expression and find several genes consistently 177

activated in the majority of samples after eliminating the PCR effects. The most abundant gene is the 178

MUC5B, which encodes a member of the mucin protein family. The family proteins are the major 179

components of mucus secretions in normal lung mucus [18]. MUC5B has been reported to be highly 180

expressed in lung disease, such as chronic obstructive pulmonary disease (COPD). Its up-regulated 181

transcription could cause mucociliary dysfunction and enhances lung fibrosis [8, 13], thus might be 182

relevant to the recent report that a lot of mucus are found in the lung of COVID-19 infected patients. 183

Another activated host gene is the TCAF2, which encodes transient receptor potential cation 184

channel-associated factor. It could bind to the TRPM8 channel and promote its trafficking to the cell 185

surface. Previous study reported that TCAF2 significantly increases the migration speed of prostate 186

cancer cells when it co-expresses with TRPM8 [5]. Other enriched genes might also break the cellular 187

homeostasis, including the nucleoredoxin-like gene NXNL2, tyrosine kinases FER, and so on. Further 188

studies of these genes could help researchers gain insights into the dysregulation of host cells after the 189

SARS-CoV-2 infection. We should point out these genes are extracted based on a very limited amount of 190

data, and the conclusions can be different with a large data sets. 191

9/13

.CC-BY 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted April 6, 2020. . https://doi.org/10.1101/2020.04.05.026146doi: bioRxiv preprint

Page 10: CovProfile: profiling the viral genome and gene ... · CovProfile: profiling the viral genome and gene expressions of SARS-COV2 Yonghan Yu 1, y, Zhengtu Li2,, Yinhu Li , Le Yu3,,

0

5

10

15

20

25

0 5000 10000 15000

sample

01

0

100

200

300

0 5000 10000 15000

sample

02

0

10

20

30

0 5000 10000 15000

sample

03

0.0

2.5

5.0

7.5

10.0

12.5

0 5000 10000 15000

sample

04

0.0

2.5

5.0

7.5

10.0

12.5

0 5000 10000 15000

sample

05

0

5

10

15

0 5000 10000 15000

sample

06

0

5

10

15

0 5000 10000 15000

sample

07

0

100

200

0 5000 10000 15000

sample

08

0

1

2

3

4

5

0 5000 10000 15000

sample

09

0

10

20

30

40

50

0 5000 10000 15000

sample

10

0.0

2.5

5.0

7.5

10.0

0 5000 10000 15000Location

sample

11

Figure 6. Normalized coverage depth on MUC5B gene in the 11 samples. In each plot, theX and Y coordinates exhibited the length of MUC5B and the normalized coverage depth on along theMUC5B gene respectively.

10/13

.CC-BY 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted April 6, 2020. . https://doi.org/10.1101/2020.04.05.026146doi: bioRxiv preprint

Page 11: CovProfile: profiling the viral genome and gene ... · CovProfile: profiling the viral genome and gene expressions of SARS-COV2 Yonghan Yu 1, y, Zhengtu Li2,, Yinhu Li , Le Yu3,,

Overall, we present a tool for genomic variation detection and the transcriptional profile of 192

SARS-CoV-2 infection from multiplex Nanopore sequencing data. The tool decomposes and realign the 193

chimeric reads that commonly exist in the multiplex Nanopore sequencing, and this will greatly promote 194

the data usage. The method also provides a robust estimation of primer efficiency through a 195

multi-parameter model. The viral and host gene expression we report could pave steps to the 196

development of diagnosis and therapy strategies of SARS-CoV-2 infection candidates. 197

Acknowledgments 198

We thank just about everybody. 199

Funding 200

Not applicable. 201

Author’s contributions 202

SCL conceived the idea. 203

SCL and FY supervised the project. 204

SCL and YY designed the algorithms 205

SCL, WJ, YY and HL discussed and designed the experiments. 206

YY and JL implemented the code and conducted the analysis. 207

ZL and LY prepared the data. 208

SCL YY WJ and YL drafted the manuscript. 209

SCL, FY revised the manuscript. 210

All authors read and approved the final manuscript. 211

References

1. Novel coronavirus (2019-ncov) situation reports,https://www.who.int/emergencies/diseases/novelcoronavirus-2019/situation-reports, 2020.

2. A. Assiri, G. R. Abedi, M. A. Masri, A. B. Saeed, S. I. Gerber, and J. T. Watson. Middle eastrespiratory syndrome coronavirus infection during pregnancy: A report of 5 cases from saudiarabia: Table 1. Clinical Infectious Diseases, 63(7):951–953, June 2016.

3. M. T. Bolisetty, G. Rajadinakaran, and B. R. Graveley. Determining exon connectivity in complexmrnas by nanopore sequencing. Genome Biology, 16(1):204, 2015.

4. E. de Wit, N. van Doremalen, D. Falzarano, and V. J. Munster. Sars and mers: recent insightsinto emerging coronaviruses. Nature Reviews Microbiology, 14(8):523, 2016.

5. D. Gkika, L. Lemonnier, G. Shapovalov, D. Gordienko, C. Poux, M. Bernardini, A. Bokhobza,G. Bidaux, C. Degerny, K. Verreman, et al. Trp channel–associated factors are a novel proteinfamily that regulates trpm8 trafficking and activityidentification of trpm8 partner proteins. TheJournal of Cell Biology, 208(1):89–107, 2015.

6. A. Gorbalenya, S. Baker, R. Baric, et al. The species severe acute respiratory syndrome-relatedcoronavirus: classifying 2019-ncov and naming it sars-cov-2. Nature Microbiology, 2020.

7. P. H. Guzzi, D. Mercatelli, C. Ceraolo, and F. M. Giorgi. Master regulator analysis of thesars-cov-2/human interactome. bioRxiv, 2020.

11/13

.CC-BY 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted April 6, 2020. . https://doi.org/10.1101/2020.04.05.026146doi: bioRxiv preprint

Page 12: CovProfile: profiling the viral genome and gene ... · CovProfile: profiling the viral genome and gene expressions of SARS-COV2 Yonghan Yu 1, y, Zhengtu Li2,, Yinhu Li , Le Yu3,,

8. L. A. Hancock, C. E. Hennessy, G. M. Solomon, E. Dobrinskikh, A. Estrella, N. Hara, D. B. Hill,W. J. Kissner, M. R. Markovetz, D. E. G. Villalon, et al. Muc5b overexpression causes mucociliarydysfunction and enhances lung fibrosis in mice. Nature communications, 9(1):1–10, 2018.

9. M. Hoffmann, H. Kleine-Weber, S. Schroeder, N. Kruger, T. Herrler, S. Erichsen, T. S. Schiergens,G. Herrler, N.-H. Wu, A. Nitsche, et al. Sars-cov-2 cell entry depends on ace2 and tmprss2 and isblocked by a clinically proven protease inhibitor. Cell, 2020.

10. W. J. Kent. BLAT—the BLAST-like alignment tool. Genome Research, 12(4):656–664, Mar. 2002.

11. H. Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–3100,05 2018.

12. C. Liu, Q. Zhou, Y. Li, L. V. Garner, S. P. Watkins, L. J. Carter, J. Smoot, A. C. Gregg, A. D.Daniels, S. Jervey, et al. Research and development on therapeutic agents and vaccines forcovid-19 and related human coronavirus diseases, 2020.

13. S. K. Mathai, S. Humphries, J. A. Kropski, T. S. Blackwell, J. Powers, A. D. Walts, C. Markin,J. Woodward, J. H. Chung, K. K. Brown, et al. Muc5b variant is associated with visually andquantitatively detected preclinical pulmonary fibrosis. Thorax, 74(12):1131–1139, 2019.

14. S. C. Moore, R. Penrice-Randal, M. Alruwaili, X. Dong, S. T. Pullan, D. Carter, K. Bewley,Q. Zhao, Y. Sun, C. Hartley, et al. Amplicon based minion sequencing of sars-cov-2 andmetagenomic characterisation of nasopharyngeal swabs from patients with covid-19. medRxiv,2020.

15. T. Phan. Genetic diversity and evolution of SARS-CoV-2. Infection, Genetics and Evolution,81:104260, July 2020.

16. M. Prachar, S. Justesen, D. B. Steen-Jensen, S. P. Thorgrimsen, E. Jurgons, O. Winther, andF. O. Bagger. Covid-19 vaccine candidates: Prediction and validation of 174 sars-cov-2 epitopes.bioRxiv, 2020.

17. J. Quick, N. D. Grubaugh, S. T. Pullan, I. M. Claro, A. D. Smith, K. Gangavarapu, G. Oliveira,R. Robles-Sikisaka, T. F. Rogers, N. A. Beutler, et al. Multiplex pcr method for minion andillumina sequencing of zika and other virus genomes directly from clinical samples. natureprotocols, 12(6):1261, 2017.

18. M. G. Roy, A. Livraghi-Butrico, A. A. Fletcher, M. M. McElwee, S. E. Evans, R. M. Boerner,S. N. Alexander, L. K. Bellinghausen, A. S. Song, Y. M. Petrova, et al. Muc5b is required forairway defence. Nature, 505(7483):412–416, 2014.

19. T. P. Sheahan, A. C. Sims, S. R. Leist, A. Schafer, J. Won, A. J. Brown, S. A. Montgomery,A. Hogg, D. Babusis, M. O. Clarke, et al. Comparative therapeutic efficacy of remdesivir andcombination lopinavir, ritonavir, and interferon beta against mers-cov. Nature Communications,11(1):1–14, 2020.

20. H. Shi, X. Han, N. Jiang, Y. Cao, O. Alwalid, J. Gu, Y. Fan, and C. Zheng. Radiological findingsfrom 81 patients with COVID-19 pneumonia in wuhan, china: a descriptive study. The LancetInfectious Diseases, 20(4):425–434, Apr. 2020.

21. S. C. Stubbs, B. A. Blacklaws, B. Yohan, F. A. Yudhaputri, R. F. Hayati, B. Schwem, E. M.Salvana, R. V. Destura, J. S. Lester, K. S. Myint, et al. Assessment of a multiplex pcr andnanopore-based method for dengue virus sequencing in indonesia. Virology Journal, 17(1):1–13,2020.

22. G. M.-K. Tse. Pulmonary pathological features in coronavirus associated severe acute respiratorysyndrome (SARS). Journal of Clinical Pathology, 57(3):260–265, Mar. 2004.

12/13

.CC-BY 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted April 6, 2020. . https://doi.org/10.1101/2020.04.05.026146doi: bioRxiv preprint

Page 13: CovProfile: profiling the viral genome and gene ... · CovProfile: profiling the viral genome and gene expressions of SARS-COV2 Yonghan Yu 1, y, Zhengtu Li2,, Yinhu Li , Le Yu3,,

23. C. Wang, P. W. Horby, F. G. Hayden, and G. F. Gao. A novel coronavirus outbreak of globalhealth concern. The Lancet, 395(10223):470–473, 2020.

24. R. White, C. Pellefigues, F. Ronchese, O. Lamiable, and D. Eccles. Investigation of chimeric readsusing the minion. F1000Research, 6, 2017.

25. F. Wu, S. Zhao, B. Yu, Y.-M. Chen, W. Wang, Z.-G. Song, Y. Hu, Z.-W. Tao, J.-H. Tian, Y.-Y.Pei, et al. A new coronavirus associated with human respiratory disease in china. Nature,579(7798):265–269, 2020.

26. Y. Xu, K. Lewandowski, S. Lumley, S. Pullan, R. Vipond, M. Carroll, D. Foster, P. C. Matthews,T. Peto, and D. Crook. Detection of viral pathogens with multiplex nanopore minion sequencing:be careful with cross-talk. Frontiers in microbiology, 9:2225, 2018.

27. P. Zhou, X.-L. Yang, X.-G. Wang, B. Hu, L. Zhang, W. Zhang, H.-R. Si, Y. Zhu, B. Li, C.-L.Huang, et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin.Nature, pages 1–4, 2020.

28. N. Zhu, D. Zhang, W. Wang, X. Li, B. Yang, J. Song, X. Zhao, B. Huang, W. Shi, R. Lu, et al.China novel coronavirus investigating and research team. a novel coronavirus from patients withpneumonia in china, 2019. N Engl J Med, 382(8):727–733, 2020.

13/13

.CC-BY 4.0 International licensewas not certified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (whichthis version posted April 6, 2020. . https://doi.org/10.1101/2020.04.05.026146doi: bioRxiv preprint