a global omics data sharing and analytics marketplace: case … · 2020. 9. 28. · intelligence....

29
A global omics data sharing and analytics marketplace: Case study of a rapid data COVID-19 pandemic response platform. Ibrahim Farah 1,2,5 , Giada Lalli 1,3 , Darrol Baker 1,4 & Axel Schumacher 1,* 1: Shivom Ventures Limited, 71-75 Shelton Street, Covent Garden, London, England, WC2H 9JQ; 2: Division of Psychiatry, University College London; 3: Katholieke Universiteit Leuven, Belgium; 4: GoldenHelix Foundation, London. England, WC2N 5AP, 5: Novo Nordisk Research Centre Oxford, Innovation Building, Old Road Campus, Roosevelt Drive, Oxford, UK. *Corresponding author: [email protected] Keywords: Data sharing | Cohort studies | COVID-19 | Omics data | Genomics | Pandemic preparedness | SARS-CoV2 | Bioinformatics | Data marketplace | Precision medicine. Abstract Under public health emergencies, particularly an early epidemic, it is fundamental that genetic and other healthcare data is shared across borders in both a timely and accurate manner before the outbreak of a global pandemic. However, although the COVID-19 pandemic has created a tidal wave of data, most patient data is siloed, not easily accessible, and due to low sample size, largely not actionable. Based on the precision medicine platform Shivom, a novel and secure data sharing and data analytics marketplace, we developed a versatile pandemic preparedness platform that allows healthcare professionals to rapidly share and analyze genetic data. The platform solves several problems of the global medical and research community, such as siloed data, cross-border data sharing, lack of state-of-the-art analytic tools, GDPR- compliance, and ease-of-use. The platform serves as a central marketplace of ‘discoverability’. The platform combines patient genomic & omics data sets, a marketplace for AI & bioinformatics algorithms, new diagnostic tools, and data-sharing capabilities to advance virus epidemiology and biomarker discovery. The bioinformatics marketplace contains some preinstalled COVID-19 pipelines to analyze virus- and host genomes without the need for bioinformatics expertise. The platform will be the quickest way to rapidly gain insight into the association between virus-host interactions and COVID-19 in various populations which can have a significant impact on managing the current pandemic and potential future disease outbreaks. Introduction In December 2019, a cluster of severe pneumonia cases epidemiologically linked to an open-air live animal market in the city of Wuhan, China 1,2 . The sudden spread of this deadly disease, later called COVID-19, led local health officials to issue an epidemiological alert to the World Health Organization’s (WHO) and the Chinese Center for Disease Control and Prevention. Quickly, the etiological agent of the unknown disease was found to be a coronavirus, SARS-CoV-2, the same subgenus as the SARS virus that caused a deadly global outbreak in 2003. By reconstructing the evolutionary history of SARS-CoV-2, an international research team has discovered that the lineage that gave rise to the virus has been circulating in horseshoe bats for decades and likely includes other viruses with the ability to infect humans 3 . The team concluded that the global health . CC-BY-NC 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257 doi: medRxiv preprint NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.

Upload: others

Post on 02-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

Aglobalomicsdatasharingandanalyticsmarketplace:CasestudyofarapiddataCOVID-19

pandemicresponseplatform.

IbrahimFarah1,2,5,GiadaLalli1,3,DarrolBaker1,4&AxelSchumacher1,*

1:ShivomVenturesLimited,71-75SheltonStreet,CoventGarden,London,England,WC2H9JQ;2:DivisionofPsychiatry,UniversityCollegeLondon;3:KatholiekeUniversiteitLeuven,Belgium;4:GoldenHelixFoundation,London.England,WC2N5AP,5:NovoNordiskResearchCentreOxford,InnovationBuilding,OldRoadCampus,RooseveltDrive,Oxford,UK.*Correspondingauthor:[email protected]:Datasharing| Cohortstudies| COVID-19| Omicsdata| Genomics| Pandemicpreparedness| SARS-CoV2| Bioinformatics| Datamarketplace| Precisionmedicine.

Abstract Under public health emergencies, particularly an early epidemic, it is fundamental that genetic and other healthcare data is shared across borders in both a timely and accurate manner before the outbreak of a global pandemic. However, although the COVID-19 pandemic has created a tidal wave of data, most patient data is siloed, not easily accessible, and due to low sample size, largely not actionable. Based on the precision medicine platform Shivom, a novel and secure data sharing and data analytics marketplace, we developed a versatile pandemic preparedness platform that allows healthcare professionals to rapidly share and analyze genetic data. The platform solves several problems of the global medical and research community, such as siloed data, cross-border data sharing, lack of state-of-the-art analytic tools, GDPR-compliance, and ease-of-use. The platform serves as a central marketplace of ‘discoverability’. The platform combines patient genomic & omics data sets, a marketplace for AI & bioinformatics algorithms, new diagnostic tools, and data-sharing capabilities to advance virus epidemiology and biomarker discovery. The bioinformatics marketplace contains some preinstalled COVID-19 pipelines to analyze virus- and host genomes without the need for bioinformatics expertise. The platform will be the quickest way to rapidly gain insight into the association between virus-host interactions and COVID-19 in various populations which can have a significant impact on managing the current pandemic and potential future disease outbreaks.

IntroductionInDecember2019, a cluster of severepneumoniacases epidemiologically linked to an open-air liveanimalmarket in thecityofWuhan,China 1,2.Thesudden spread of this deadly disease, later calledCOVID-19, led local health officials to issue anepidemiological alert to the World HealthOrganization’s (WHO) and the Chinese Center forDisease Control and Prevention. Quickly, theetiologicalagentoftheunknowndiseasewasfound

tobeacoronavirus,SARS-CoV-2,thesamesubgenusas the SARS virus that caused a deadly globaloutbreak in 2003. By reconstructing theevolutionary history of SARS-CoV-2, aninternationalresearchteamhasdiscoveredthatthelineage that gave rise to the virus has beencirculatinginhorseshoebatsfordecadesandlikelyincludes other viruses with the ability to infecthumans3.Theteamconcludedthattheglobalhealth

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.

Page 2: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

2

system was too late in responding to the initialSARS-CoV-2 outbreak, and that preventing futurepandemics will require the implementation ofhumandiseasesurveillancesystemsthatareabletoquickly identify novel pathogens and theirinteraction with the humans host and respond inreal time. The key to successful surveillance isknowing which viruses caused the new outbreakandhowthehostgenomeaffectsseverityofdisease.Inoutbreaksofzoonoticpathogens,identificationofthe infection source is crucial because this mayallow health authorities to separate humanpopulationsfromthepathogensource(e.g.wildlifeordomesticanimalreservoirs)posingthezoonoticrisk3.Evenifstoppinganoutbreakinitsearlystagesis not possible, data collection, data sharing, andanalysisofhost<>virusinteractionsisneverthelessimportant for containment purposes in otherregions of the planet and prevention of futureoutbreaks. Such surveillance measures are easilyimplemented with the pandemic preparednessplatformshowcasedinthisreportthatisbasedonthe novel data sharing and data analyticsmarketplace Shivom4 (https://www.shivom.io/).Theplatformisaprovenresearchecosystemusedby universities, biotech, and bioinformaticsorganizationstoshareandanalyzeomicsdataandcanbeusedforavarietyofusecases;fromprecisionmedicine, drug discovery, translational science tobuilding data repositories, and tackling a diseaseoutbreak.In the current coronavirus outbreak, the globalresearch community is mounting a large-scalepublichealthresponsetounderstandthepathogenandcombatthepandemic.Vastamountsofnewdataarebeinggeneratedandneedtobeanalyzedinthecontextofpre-existingdataandsharedforsolution-centeredresearchapproachestocombatthediseaseand inform public health policies. However,research showed that the vastmajority of clinicalstudies on COVID-19 do not meet proper qualitystandardsduetosmallsamplesize5.Our approach is designed to provide healthcareprofessionalswithanurgentlyneededplatformtofind and analyze genetic data, and securely andanonymously share sensitive patient data to fightthediseaseoutbreak.Nopatientdatacanbetracedbacktothedatadonor(patientordatacustodian)because only algorithms can access the raw data.Researchersqueryingthedata-hubhavenoaccesstothegeneticdatasource, i.e.norawdata(e.g.nosequencing reads can be downloaded). Instead,researchers and data analysts can run predefinedbioinformaticsandAIalgorithmsonthedata.Onlysummary statistics are provided (e.g. GWAS,

PHEWAS,polygenicriskscoreanalyses).Inadditionto genetics bioinformatics pipelines, specializedCOVID-19 bioinformatics tools are available (e.g.,assembly, mapping, andmetagenomics pipelines).Using these OPEN tools, researchers can quicklyunderstandwhysomepeopledevelopsevereillnesswhile others do show only mild symptoms or nosymptomswhatsoever.Wemighthypothesizethatsome people harbor protective gene variants, ortheir gene regulation reacts differently when theCOVID-19virusattacksthehost.BreakingdowndatasilosThe usability of digital healthcare data is limitedwhen it is trapped in silos limiting its maximumpotential value. Examples of COVID-19 datadepositories include the European CommissionCOVID-19 Data Portal6, the Canadian COVIDGenomics Network7 (CanCOGeN), GenomicsEnglandandtheGenOMICC(GeneticsofMortalityinCritical Care) consortium8, the COVID-19 HostGenetics Initiative9, the National COVID CohortCollaborative (N3C)10, the COVID-19GenomicsUK(COG-UK) Consortium11, the Secure CollectiveCOVID-19 Research (SCOR) consortium12, theInternational COVID-19 Data Research Alliance13,and theCovidHumanGeneticEffort consortium14,among several others. All of them, while veryvaluable for fighting the pandemic, presentcomplicated access hurdles, limited data-sharingcapabilities, and missing interoperability andanalyticscapabilitiesthatlimittheirusefulness.Compartmentalized information limits ourhealthcaresystem’sabilitytomakelargestridesinresearchanddevelopmentrestrictingtheavailablebenefitsthatcanbegeneratedforthepublic.Thisisespeciallytrueforomicsdata.Ifomicsdatasilosarebroken up and can be globally aggregated, webelievethatsocietywillrealizeagainofvaluethatobserves accelerating returns. If individuals andmedical centers around the world shared theirgenomicdatamorefully,thenasawholethisdatawouldbesignificantlymoreuseful.ParticularlytheroleofhostgeneticsinimpactingsusceptibilityandseverityofCOVID-19hasbeenstudiedonlyinafewsmallcohorts.Expanding theopportunities to findpatients somewhere in the world that possessunusual mutations and phenotypes relevant tonational efforts, or have SARS-CoV-2 resistantmutations, would shift precision medicine toanotherlevel.Bycombininggenomesequenceddatawithhealthrecords, researchers and clinicianswouldhave aninvaluable resource that can be used to improvepatient outcomes, and additionally be used to

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 3: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

3

investigate the causes and treatments of diseases.For big data approaches to thrive, however,historical barriers dividing research groups andinstitutionsneedtobebrokendownandaneweraofopen,collaborativeanddata-drivenscienceneedstobeusheredin.Thereisalsoacosttousingandproducingdata,andif this cost is shared amongstmany stakeholder’snew data sources, which outperform olderdatabases significantly, can be built. Anotherobstacleisdatagenerationitself.Genomeshavetocome from somewhere and currently researchershave limited access to them. Healthcareorganizations and private individuals often haverestrictedaccesstotheirowndataandevenwhentheydo, theyare limited in theways theycanusethisdata,or forsharingwiththirdparties(e.g. forscientific studies). Computer scientists,bioinformaticians, and machine learningresearchersaretacklingthepandemicthewaytheyknow how: compiling datasets and buildingalgorithmstolearnfromthem.Butthevastmajorityof data collected is not accessible to the public,partly because there is no platform that offersanonymized data sharing on a global level, whileaddressing privacy concerns, and regulatoryhurdles.Whatisneededisaplatformthatcanstoredata in a secure, anonymized, and impenetrablemanner with the goal of preventing maliciousattacks or unauthorized access. While ensuringaccess controls are maintained and satisfied, thedatabase must also be user friendly, publiclyaccessible,andsearchable.Inaddition,theremustbewaystoquicklyandeasilyanalyze the collecteddata. Thesepropertiesmakethedatainthedatabaseusable,meaningitmustbeeasy for use in population health studies,pharmaceutical R&D, and personal genomics. Inaddition, such a platformmust be accessible on aglobal scale, aswell as offer data provenance andauditing features. We believe that for such aplatformtobesuccessful,addingnewgenomicdatato the platform must be trivial and accessible toeveryone. We aim to provide such a platform byusingconvergenttechnologies includinggenomics,cryptography, distributed ledgers, and artificialintelligence.DatasharingduringapandemicGlobalpandemicscomeandgo.Thetragicimpactsoftheunusuallydeadlyinfluenzapandemicin1918,called the Spanish flu, or the bubonic plague thatkilledasmanyasone-thirdofEurope’sandNorthAfrica’s population in the14th century remindsus

thatCOVID-19wasn’tthefirstglobalpandemic,anditisclearthatitwon’tbethelast.IncontrasttotheMiddleAges,wenowhave the technological toolsand exceptional scientific networks to fightpandemics. Already, the COVID-19 pandemic hasdemonstratedthatsharingdatacanimproveclinicaloutcomesforpatients,andthatquickaccesstodataliterally has life-or-death consequences inhealthcare delivery. However, only a minority ofhealthcareorganizations areparticipating indata-sharing efforts. Without proper coordinationamongst initiatives and agencies, and making theefforts public, these initiatives run the risk ofduplicating efforts or missing opportunities,resulting in slower progress and economicinefficienciesandfurtherdatasilos.Comprehensive solutions to problems such as theSARS-CoV-2 virus outbreak require genuineinternational cooperation, not only betweengovernmentsandtheprivatesector,butalso fromalllevelsofourhealthsystems.Researchersfromallover the world need to be able to act quickly,withoutwaiting for international efforts to slowlycoordinate their efforts, to understand early localoutbreaks and to start intervention measureswithoutdelay.Newsolutionstotacklethepandemicneed also early adopters thatwant tomove awayfromourarchaicdatainformationsystemsthatcanhamstringreal-timedetectionofemergingdiseasethreats.TheneedforrapidopendatasharingMany research and innovation actors havereoriented some of their previously fundedactivities towards COVID-19, but often with littleguidance from policy makers. Despite allcomplications and concerns, the advantages ofsharingdataquicklyandsafelycanbesignificant.Alack of coordination at international level onnumerous initiatives launched by differentinstitutionstocombatCOVID-19resultedinnotablyreduced cooperation and exchanges of data andresultsbetweenprojects,limitedinteroperability,aswellaslowerdataqualityandinterpretation.Opendata plays a major role in the global response tolarge-scaleoutbreaks.For instance,analysisof the2016 Ebola outbreak points to the importance ofopen data, including genomic data to generatelearnings about diseases where effective vaccinesare lackingorhavenotbeendeveloped. Similarly,opendataplayedamajorroleintheresponsetothe2016 Zika outbreak with a commitment fromleadingnationalagenciesandscienceorganizationsto share data. With the Shivom platform, it is

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 4: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

4

possibletohelptheglobalresearchcommunityfast-track discover new preventive measures andoptimize logistic solutions. Moreover, this datalendsitselfnaturallyinpredictingandpreparingforfuture,inevitablediseaseoutbreaks.Acommitmenttoopenaccessingenomicsresearchhasfoundwidespreadbackinginscienceandhealthpolicy circles, but data repositories derived fromhuman subjects may have to operate undermanaged access, to protect privacy, align withparticipant consent and ensure that the resourcecanbemanaged ina sustainableway.Dataaccesscommittees need to be flexible, to cope withchangingtechnologyandopportunitiesandthreatsfromthewiderdatasharingenvironment.Trade-offbetweentimelinessandaccessibilityThe spread of SARS-CoV-2 has taught us thatprevention relies on community engagement andhas exposed the fragility in our researchrelationships. Inan idealscenario,healthworkers,suchasdoctors,virologists,datacustodiansatdata-repositories, biobanks and in pharma, as well asstudentsattheuniversityshouldbeabletoquicklycollect, share, and analyze health data in a global,cross-bordermannerwithoutcomplicated,lengthyaccess-procedures. The ability to share datasecurely should not be restricted to selectedspecialistsinthefieldbutshouldbepossibleforallhealthcareworkers,assuchtheaccessandsharingproceduresneedtobeeasy.Datacollectedduringapandemic should not only be accessible, but alsoanalyzable;meaningallstakeholdersandthepublicshouldhavetheabilitytoanalyzethedata,notonlythe data custodians or project leads of researchconsortia.Havingsuchanaffordableandeasy-to-useplatforminplace,evendevelopingcountriesareenabled towork together to reduce a pathogens’ spread andminimize its impact, despite severe resourceconstraints.However,ourhealthsystemsare-moreoftenthannot-conservativeandinnovationaverse.

What is needed is true leadership in the researchcommunitythatisabletoadoptnewsolutions,whoarereadyto tackle theuncertaintyasituation likeCOVID-19 creates. There is no value in competingfor data during a global crisis, instead humanityneedstocollaboratemorethaneverbeforetodefeatacommonenemythathastakentheworldhostage.Collective solutions that provide a ‘one-stop shop’forthecentralizationofinformationonpathogen<>host interactionscanhelpensurethatappropriateconditionsforcollaborativeresearchandsharingofpreliminaryresearchfindingsanddataareinplacetoreaptheirfullbenefits.Theprovidedplatformofferssuchaone-stopshopsolution, without being an exclusive storingsolution. Data shared on the platform can also becollectedandsharedelsewhere.Beyondshort-termpolicyresponsestoCOVID-19,theShivomplatformsupports research and innovation that could helptackle future epidemics early, on a national orinternationalscale.

MethodsTheShivomdataanalyticsplatformwasbuiltonaprinciple of open science, transparency, andcollaboration and has been designed to giveresearchersaone-stopecosystemtoeasily,securelyand rapidly share and analyze their omics data,therefore simplifying data acquisition and dataanalysisprocesses.Nocodingproficiencyisneededforusingtheplatform,soeveryonecanuseitandgetresults, minimizing work times. Our organizationstarted out with the mission to accelerate dataaccess, easy analysis, and exchange acrossinternational and organizational boundaries topromoteequalaccesstohealthcareforall.The platform facilitates sharing and analysis ofgenetic,biomedical,andhealth-relateddatainasafeand trusted environment and contains twoconnectedmarketplaces,oneforomicsdataandonefor bioinformatics/AI pipelines and tools (see Fig.1).

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 5: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

5

Figure 1: Data Flow in the Shivom platform. The user does not need in-depth bioinformatics expertise and is guided through all necessary steps from data upload to data analytics. Basically, data from a variety of sources is uploaded by the data custodian and data permission rules are set. If the data is set to open or monetizable, it will be searchable in the data marketplace. Once a dataset of interest is identified and filtered, the user can select a bioinformatic workflow (pipeline), configure user-specific parameters if needed, and then start the analysis. The user can then investigate and share the resulting report and, depending on the workflow, run further downstream analysis. Itwas shown recently that clinical data collectionwithrapidsequencingofpatientsisavaluabletoolto investigate suspected health-care associatedCOVID-19 cases15. Similarly, the Shivom platformcan enable quick insight into local outbreaks toidentify opportunities to target infection-controlinterventions to further reduce healthcareassociated infections. If widely adopted byresearchers and doctors around the globe, thiswealthofgenomicandhealthcare information isapowerfultooltostudy,predict,diagnoseandguidethe treatment of COVID-19. The information fromcombined epidemiological and genomicinvestigations can be fed back to the clinical,infection control, andhospitalmanagement teamsto trigger further investigations and measures tocontrolthepandemic.AccessManagementSigning up on the platform is easy and doesn’trequire complicated access procedures. The ideabehind this mechanism is to provide access toclinical and health data and analytics tools foreveryone. Calls for data sharinghavemostly beenrestricted topublicly fundedresearch,butweand

others argue that thedistinctionbetweenpubliclyfundedandindustry-fundedresearchisanartificialand irrelevant one16. At the center of precisionmedicine should be the patient which shouldoverride commercial interests and regulatoryhurdles. Data sharing would lead to tremendousbenefits for patients, progress in science, andrational use of healthcare resources based onevidence.Assuch, theShivomplatformstandsoutforitseaseofusefromtheveryfirststeps;theonlyrequirements are a valid email, preferablyinstitutional,incasetheuserwantstouseadvancedfeaturesthatcomewithinstitutionalplans.Theuserneedstosetapassword,andifrequired,canutilizeanadditional2stepverificationproceduretoaddanextralayerofsecurity.The personal data stored by the platform isprotected under GDPR regulations whether it ishostedintheEuropeanUnionoranothercountryonother continents. After the sign-up, users aredirectedtoasettingsandprofilepagethatallowstoadd additional information about the user andorganization,researchinterestsanddatasetneeds,ortoswitchbetweenlightanddarkmodetomakethe user experience more personalized. Thepersonalprofilecanbechangedatanytime.

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 6: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

6

Dependingontheaccounttype(e.g.academicgrouporcompany),theusercaninviteteammemberstotheworkspacetoworkonprojectstogetherandtoshare data and bioinformatics pipelines. Teammemberscanbeeasilyassigneduserroles,i.e.anynewusercanbeupgradedtoadministratorfortheaccount.

Figure 2: Team members from the same organization can be invited to join the workspace.

Teammemberscanberemovedanytimefromtheaccount by the administrator (project lead ormanager).Withthe‘ManagePlan’tab,theplatformcanbeadaptedandscaledtomeettherequirementsoftheusers,fromsingle/privateusers(e.g.studentsor freelance bioinformaticians), academia andstartups,tosmallandmediumenterpriseswithvastamountsofdataanduniquerequirementsaswellaspharmaceutical organizations looking for customsolutions.DataManagementThe Shivom platform was designed to make theprocess of uploading and sharing COVID-19 andother genetic data as easy and simple aspossible.The reason is that access to the data sharing andanalysis featuresshouldnotberestricted toa fewtrainedexperts(dataanalystsorbioinformaticians)butshouldbeeasyforallhealthcareprofessionals.Importing records (genetic ormetadata files) intotheuser’sworkspace isdonevia the ‘MyDatasets’page. In case of sequencing data, the user willprimarily choose to upload Variant Call Format(VCF) files.VCF is a tabdelimited text file format,oftenstoredinacompressedmanner,thatcontainsmeta-information lines, a header, and then datalineseachcontaining informationaboutaposition

inthegenome17.Theformatalsohastheabilitytocontaingenotype informationonsamples foreachpositioninthegenome(virusorhost).

VCFisthepreferredformatontheplatformbecauseit is unambiguous, scalable and flexible, allowingextrainformationtobeaddedtotheinfofield.ManymillionsofvariantscanbestoredinasingleVCFfile,andmostbioinformaticspipelineson theplatformuseVCF filesas input.MostNGSpipelinesuse theFASTQ file format, the most widely used format,generally delivered from a sequencer. The FASTQformat is a text-based format for storing both anucleotidesequenceand its correspondingqualityscores.FASTQ.gzisthepreferredfileformatforthestandard COVID-19 pipelines preinstalled on theShivomplatform.Despite VCF being very popular especially in thestatistical genetic area, there are also other fileformatsthatcanbeusedontheuploadordirectlyatthe point of configuring pipeline parameters. Thepurpose of these other genetic file formats is toreduce file size – from many VCFs to just threedistinctfiles:a).bed(Plinkbinarybiallelicgenotypetable), that contains the genotype call at biallelicvariants, b) .bim (Plink extendedMAP file)whichalsoincludesthenamesofthealleles:(chromosome,SNP, cM, base-position, allele 1, allele 2); c) .fam(Plinksampleinformationfile),isthelastofthefilesand contains all the details with regards to theindividualsincluding,whetherthereareparentsinthe datasets, and the sex (male/female). This filewill also contain information on thephenotype/traitsoftheindividualsinthecohort.Ideally, the sequencing file is accompanied byphenotypic metadata (e.g. is it patient or controldata,age,height,comorbidities,etc..),describingtheuploadeddatasets(seeMetadata).ThemetadatafileisuploadedasaCSVfile,acomma-separatedvaluesfile, which allows data to be saved in a tabularformat. CSV files can be used with most anyspreadsheet program, such as Microsoft Excel orGoogle Spreadsheets. For ease-of-use, a sampletemplatefileisprovidedontheplatform.DataMarketplacePermissionSettingsIn the next step, the user has the option to savespecificdatasharingpermissionsettingsrelatedtotheuploadeddatasettoallowforfine-graineddatasharing with others. These unique permissionsettingstransformthedataanalyticsplatformintoamassive data marketplace. Providing healthinformationforadatamarketplaceandsharingitisa key in leveraging the full potential of data inprecisionmedicineandthefightagainstpandemics.

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 7: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

7

Themarketplace isacentralpointofentryfortheusersinordertofindneededdatafasterandmorestructured. In addition, interoperability can becreated through the marketplace, which supportscross-sectoral applications. Unrestricted access torelevant pandemic data in particular will becomeincreasinglyimportantinthefuture.Theuserhasthreeoptions:Open–Alldatadesignatedopenwillbeavailablefor thepublic to analyze.However, opendoesnotmeanthatthedataisdownloadable,thispermissionsettingonlyallowsonlyforalgorithms/pipelinestousethedatainthedataanalyticsmodule.Allopendataissearchableontheplatform.Beingadata-hubfor COVID-19 data, we highly encourage allresearchersanddatacustodianstomaketheirdataopen/public and accessible for the scientificcommunity,therebyincreasingtheamountofopen-source COVID-19 data currently available andusable.Private– Privatedatawill be stored in a secure,private environment, and is only available andanalyzable by the data owner and teammembersfromthesameorganizationorprojectandwillnotshow up in data search results. This setting isusuallyusedwhendataisnotsupposedtobeshared(oronlywithinaconsortium),orifthedataneedstobe analyzed or published before sharingcommences.Monetizable – By choosing this option, the dataowner/custodian authorizes the marketplace tolicense their informationon theirbehalf followingdefinedtermsandconditions. In thiscontext,dataconsumers can play a dual role by providing databacktothemarketplace.Datamarketplacesmakeitpossible to share and monetize different types of

information to create incremental value. Bycombining information and analytical models andstructurestogenerateincentivesfordatasuppliers,moreparticipantswilldeliverdatatotheplatformandtheprecisionmedicineecosystem.Tomonetizepandemic data, making datasets available forresearchersatacost,canmakesenseifalaboratoryisnototherwiseabletofinancethedatacollection,e.g. when located in an underfunded developingnation.Atanytime,thepermissionsettingsofdatafilescanbeupdated.Thesettingshaveanaudittrailandtheplatformownerorserverhostarenotabletochangethesettings,onlythedataowner/custodianisabletomodifythesettings.In the next step during data import is to addadditionalinformationtothedataset,suchadatasetname,aDigitalObjectIdentifier(DOI),healthstatus,Author or Company name, a description of thedataset, or the name of the genotyping orsequencing system used (a list of sequencingplatformsisavailableinadropdownmenu).Otherfeatureswillbeaddedinlaterversions.After requesting the user’s digital signature toconfirmtheoperationscarriedoutso far, thedatawillbeuploaded,andtheuserwillberedirectedtoapagecontainingasummaryofhis/herdatasets.During the upload, the data is checked for qualitymetrics and for duplicates (compared to dataalready existing on the data marketplace) byinternal algorithms that prevent monetization ofdatasetsthatweresharedbyothersbefore.Geneticdataislargelytransformedintoacommonformataspartofthedatacurationprocess.Theplatformalsostarts automated proprietary anti-fraudmechanisms to identify data that was tamperedwith.

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 8: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

8

Figure 3: Top: Data Upload (dark mode). After choosing a file format from a drop-down menu, files can be either dragged and dropped to the workspace or uploaded via a new window. Bottom: Explicit data sharing permission are available to share data with selected collaborators, within a consortium, or with clients, co-workers, organizations or external analysts.

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 9: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

9

Figure 4: Dataset view that shows all data stored in the user’s workspace. The different permission settings are highlighted by designated icons and can be changed any time.

Alldatainauser’sworkspaceareeasilysearchableand can be sorted by therapeutic area (based oncollection of biomedical ontologies) and otherparameters. Generally, the workspace allows forbuildinglargerdatasetsintime,andatanypointtoshare data whenever it is needed, or to revokeaccess. There are several potential use cases forusingtheplatformasstoragespacefordata.First of all, having a sharing infrastructure willenable healthcare professionals to quicklyaccumulate and share anonymized data with thepublic (e.g. sequencing data from COVID-19patients),whichisparticularlyimportantduringanearlydiseaseoutbreak.Second, the platform is an ideal storageinfrastructure for data custodians (e.g. biobanks,direct to consumer genetic companies, or patientsupportgroups)thatinherentlyhavelargecuratedpatientandvolunteerdatasets.Thedatacanthenbemonetized,butmoreimportantly,thedatasetscanbe made available to the wider research andscientific community to make medicalbreakthroughs. Since the basis of the platform isprecision medicine, at this point in time, datacustodiansareencouragedtostoretheirdataasVCFfiles;however, aplethoraofother file formatsareplannedtobeaddedtotheplatformintime.VCFareparticularly useful as they are one of the final file

formatsofmostgenomic technologiesacrossbothmicroarray chips and sequencing (genome VCF/gVCF).Theyarealsothemostcommonfile formatwhen it comes to secondary genetic analysis, butequally their smaller file size reduces the cost onstorage.Third, using the platform as a repository forresearch data during a peer-review publicationprocess. Researchers are increasingly encouraged,or even mandated, to make their research dataavailable, accessible, discoverable and usable. Forexample, the explicit permission settings in theplatformwouldallowresearcherstograntaccesstothe journal’s reviewers to theirdataandanalyses,prior to publication. This time-restricted datasharingcanbevaluablewhendatashouldnotyetbeavailabletothewiderpublic.Fourth, the platform is an ideal data sharinginfrastructure for research consortia. Groups thatare located in different countries can easily sharedataandresultandbuild larger, statisticallymorerelevantdatasets. Suchaprocedure isparticularlyvaluable for pre-competitive consortia, tosignificantlylowerthecoststoaccumulatevaluabledatasets, while the in-house research can be keptconfidential.

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 10: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

10

Patientprivacy&datasecurityThe Shivom platform was developed with acomprehensive security strategy to protect userdataprivacyandsecurity.PatientPrivacyThecentralpillaroftheShivomdata-hubispatientdata privacy and confidentiality. The duty ofconfidentiality is a medical cornerstone since theHippocraticOath,datingbacktothe5thcenturyBC.Ontheotherhand,therighttoprivacyisarelativelyrecentjuridicalconcept18.Theemergenceofgeneticdatabases, direct-to-consumer genetics, andelectronic medical record keeping has led toincreased interest in analyzing historical patientand control data for research purposes and drugdevelopment. Such research use of patient data,however,raisesconcernsaboutconfidentialityandinstitutionalliability19.Obviously,doctorsandotherhealthcare workers have to make sure that noprivate and protected information of a COVID-19patient, such as name, address, phone number,emailaddress,orbiometricidentifiersareexposedto3rd parties.As such, institutional reviewboardsand data marketplaces must balance patient datasecurity with a researcher’s ability to explorepotentially important clinical, environmental andsocioeconomicrelationships.Inthiscontext,theplatformwasdesignedfromthebeginningtobeGDPRcompliantandtoprovidefullanonymization of patient records (privacy bydesign). The advantage of this approach is thatpersonalorconfidentialdatathathasbeenrenderedanonymousinsuchawaythattheindividualisnotor no longer identifiable is no longer consideredpersonaldata.Assuch,bymaking it impossible toconnectpersonaldatatoanidentifiableperson,datacontrollers (i.e. the data custodian or patientwhocan upload and manage data) and processors(researchers) are permitted to use, process andpublishpersonalinformationinjustaboutanywaythat they choose. Of course, for data to be trulyanonymized, the anonymization must beirreversible. Anonymization will provideopportunities for data controllers to use personaldatainmoreinnovativewaysforthegreatergood.Full anonymization isachievedwith several steps.Firstofall,noprivate,identifiabledataofpatientsisstoredtogetherwithclinicaldata.Second,andmostimportantly, the platform provides only summarystatistics/analyticsonthedata.Thatmeanssharingof data only allows bioinformatics pipelines ormachine learningalgorithms to touch thedata;noraw data (e.g. DNA sequencing reads) can bedownloadedorviewedbythedatauser/processor.

All datamust be stripped of any information thatcould lead to identification, either directly orindirectly—for instance, by combining data withother available information, such as genealogydatabases20. However, at this point in time, theconsent mechanisms as well as the process ofremovingprivateorconfidential information fromraw data remains with the data custodian, forexample the hospital that uploads the COVID-19-relatedsequences.Inthiscontext,itisimportanttounderstandthatonlythedataowner/custodianhasalltheaccessrightsandcanmanagethedatasuchasgrantingpermissionstousethedata.Theplatformproviderhasnopowertomanagethedataandonlyhasthepossibilitytoidentifyduplicatedatasetsthatwill automatically be excluded from the datamarketplace.Othersecurityfeaturesontheplatformarethatnostatistical analyses are possible on singleindividuals, and only vetted algorithms andbioinformatics pipelines are available on theanalyticsmarketplace toavoid thatalgorithmsaredeployed to specifically read out raw data fromindividualsortocross-linkdatafromindividualstoother,de-identifyingdatabases.DataSecurity&EncryptionBy harnessing the power of cloud computing theplatform enables data analysts and healthcareprofessionals,withlimitedornocomputationalanddataanalysistraining,toanalyzeandmakesenseofgenomicdataandCOVID-19inthefastestandmostinexpensivewaypossible.Thegeneticdata storedwith the platform is handled and managed withstate-of-the-art technology and processes, using acombination of security layers that keep data andmanagement information (cryptographic keys,permission settings etc.) safe. Currently, all theCOVID-19relateddataisuploadedtoAmazonWebServices(AWS)S3bucketsthatprovidedatastoragewith high durability and availability. Like otherservices, S3 denies access from most sources bydefault.The Shivom platform requires specific permissionto be allowed to perform transactions orcomputationactionsonthebucket.TheS3storageserviceallowstodevoteabuckettoeachindividualapplication,grouporevenanindividualuserdoingwork. Access to the buckets is regulated via theplatform on different, non-AWS server(s), and allinformationrelatedtotheaccessisencryptedusingstate-of-the-artKeyManagementServices(KMS).Inaddition,nouser/custodiandataisstoredalongsidegenetic data and is completely separated andencryptedusingdifferentcloudstoragesolutions.

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 11: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

11

ImmutableAuditTrail&SmartContractsOnepowerfulfeatureofoursolutionsistheabilitytomaintainanimmutabledata-accessaudittrailviablockchain.Thisimmutabilitypropertyiscrucialinscenarioswhere auditability is desired, such as inmaintaining access logs for sensitive healthcare,genetic andbiometric data. In precisionmedicine,accessaudittrailsforlogsprovideguaranteesthatasmultiple parties (e.g., researchers, laboratories,hospitals, insurance companies, patients and datacustodians) access this data create an immutablerecord for all queries21–23. Since blockchains aredistributed,thismeansthatnoindividualpartycanchange ormanipulate logs24,25. Such a feature canhelp journal reviewers and research collaboratorsto protect themselves from inflation bias, alsoknownas“p-hacking”or“selectivereporting”.Suchbias can be the cause of misreporting ormisinterpretation of true effect sizes in publishedstudiesandoccurswhenresearcherstryoutseveralstatisticalanalysesand/orswitchbetweenvariousdata sets and eligibility specifications, and thenselectively report only those that producesignificantresults.All permission settings stored in the system arecodedintoablockchain,guaranteeingthatnobodycantemperwiththepermissionsettingprovidedbythedataowner.Eachdatatransferontheplatformrepresents a cryptographic transaction hash.Hashing means taking an input string from theplatformofanylengthandgivingoutanoutputofafixedlength.Inthecontextofblockchaintechnology,thetransactionsaretakenasinputandrunthroughahashingalgorithm(e.g.ahashfunctionbelongingtotheKeccakfamily,thesamefamilywhichtheSHA-3hashfunctionsbelongto)whichgivesanoutputofafixedlengththatisstoredimmutableonthechain.Atransactionisalwayscryptographicallysignedbythesystem.Thismakesitstraightforwardtoguardaccesstospecificmodificationsofthedatabase.Transactions such asmonetization of datasets areencoded as smart contracts on the blockchain. A“smart contract” is simply a piece of code that isrunning on a distributed ledger, currentlyEthereum.It’scalleda“contract”becausecodethatrunsontheblockchaincancontroldigitalassetsandinstructions. Practically speaking, each contract isan immutable computer program that runsdeterministicallyinthecontextofaVirtualMachineaspartoftheEthereumnetworkprotocol—i.e.,onthedecentralizedEthereumworldcomputer.Oncedeployed, the code of a smart contract cannotchange. Unlike with traditional software, the only

waytomodifyasmartcontractistodeployanewinstance,makingitimmutable.AllthecontractsusedbytheShivomplatformonlyruniftheyarecalledbyatransaction,e.g.whenafileis used in a data analytics pipeline. In this case, asmart contract is triggered that can, if the filepermissionsaresettomonetizable,transferafeetotheelectronicwalletofthedataowner.Theoutcomeoftheexecutionofthesmartcontractisthesameforeveryone who runs it, given the context of thetransactionthatinitiateditsexecutionandthestateoftheblockchainatthemomentofexecution.If needed, smart contracts can be put in place toaccommodate specific terms needed in morecomplexresearchordrugdevelopmentagreements,e.g.ifdatasetsbelongtoaconsortiumorabiobank,oriftherearedatalockconstraints(e.g.anembargoon data release) and usual access procedures arecomplicated.Forexample,mostbiobankscontractsfor accessing data are long, arduous, detaileddocuments with complicated language. Theyrequire lawyersandconsultants to framethem, todecipherthem,andifneedbe,todefendthem.Thoseprocesses can bemademuch simpler by applyingself-executing,self-enforcingcontracts,providingatremendous opportunity for use in any researchfieldthatreliesondatatodrivetransactions.Thesecontractsalreadypossessmultipleadvantagesovertraditional arrangements, including accuracy andtransparency.Thetermsandconditionsofthesecontractsarefullyvisibleandaccessibletoallrelevantparties.Thereisno way to dispute them once the contract isestablished,which facilitates total transparencyofthe transaction to all concerned parties. Otheradvantages are speed (execution in almost real-time),security(thehighestlevelofdataencryptionis used), efficiency, and foremost trust, as thetransparent,autonomous,andsecurenatureoftheagreementremovesanypossibilityofmanipulation,bias,orerror.MetadataAdvances in data capture, diagnostics and sensortechnologies are expanding the scope ofpersonalizedmedicinebeyondgenotypes,providingnewopportunities fordevelopingricherandmoredynamic multi-scale models of individual health.Collecting, and sharing phenotypic andsocioeconomic data from patients and healthycontrols, in combination with their genotype andotheromicsprofiles,presentnewopportunitiesformapping and exploring the critical yet poorlycharacterized "phenome" and "envirome"dimensionsofpersonalizedmedicine.Therefore,in

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 12: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

12

the context of COVID-19, if available, metadatashouldbesharednexttosequencinginformationofthepatients.Suchmetadatacancontaininformationwith regards to the comorbidities, ethnicbackground, treatment and support that theyreceive rather than biological or clinical data.Currently,metadataisuploadedasaCSVfiletotheplatform,whichallowsalmostallphenotypicdatatobesaved ina tabular format.Other inputsmaybeavailable in the future, e.g. to parse informationdirectly fromamedical recordor fromwearables.Using a simple tabular format also allows it to besource agnostic, e.g. it is not difficult to combinegeneticinformationwithpatientquestionnaires.Indeed, patients support groups, direct-to-consumer genetic companies or patient registriesarelesslikelytohavebiologicaldatatypesintheirrepositories and often have only metadata thatcomes from questionnaires that volunteers havefilled.AnexampleofthisisTheDementiaPlatformUK (DPUK)26, a patient registry that comprisesthousands of individuals from across dozensindependent cohorts - which allows a variety ofanalysistobeconductedincludingmachinelearningand mendelian randomization techniques. Theplatformismulti-purposebuilt,whichallowsforalltypesofdatathatisalpha-numericalencodedtobestored.Otherplatformsareengineeredspecificallyforaparticulartypeofcohort,e.g.tailoredtowardsanalyzing biobank data. The Shivom platform is

completely source agnostic and is engineered toallow large cohort datasets to be uploadedregardlessofbiobanksorcohorts.Additionally,theplatformwillallowforbiostatistical (non-genetic),epidemiological and machine learning analysistypestobeundertaken.Themetadataisstructuredinasimpleandintuitivewayinordertoallowtheusertobeabletoquicklyquerythedataset.Dependingonwhetherthedatabeing uploaded has accompanying geneticsinformation, the first columnof the filewill eithercontainthenameoftheVCFfileorthefirstmetadataattribute.CSVfilesaresupportedbynearlyalldataupload interfacesandtheyareeasier tomanage ifdata owners are planning to move data betweenplatforms,exportandimportitfromoneinterfacetoanother. A template that illustrates howmetadatacanbeorganized,includingastandardmetadataset(e.g. height, smoking status, gender, etc..) isprovidedontheplatform.For COVID-19 datasets (provided as VCF) weencourage data providers to share as many aspossible phenotypic, clinical and socioeconomicdata, keeping in mind that all data should beanonymized. Working with several COVID-19consortia, we established a guideline of what anidealdataset should include, keeping inmind thatmany clinical data are usually not available 9seesupplementformetadatastructure):• Ethnicity,Age,Gender,Height,Weight

Figure 5: Discovery page with search results. On the right pane, the list of datasets can be further filtered in a fine-grained manner according to phenotypic or socioeconomic data contained in the associated metadata files, allowing for fine-tuning of the curated dataset.

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 13: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

13

• Smokinginformation• Comorbidities (i.e. Diabetes, cardiovasculardiseases, renal insufficiency, COPD, Malignanttumor,autoimmunediseases,immunodeficiency,asthma,cystic fibrosis,Dementia,Liverdisease,Hemiplegia,Leukemia,Lymphoma,AIDS…)

• Charlsoncomorbidityindex27• Pneumonia status: non-present, present:unilateral,present:bilateral(dual)

• Sequential Organ Failure Assessment (SOFA)score28

• ClinicalPulmonaryInfectionScore(CPIS)29• Quick sepsis related organ failure assessment(qSOFA)30

• Clinical Outcome: oxygen, continuous positiveairway pressure (CPAP), intubation data, dayswithsymptoms,antiviralatdiagnosis,mortality,andclinicallaboratorydata.

DataSearchTheplatformsupports fine-grainedsearchqueriesacrossglobalomicsdatasets.Theideaistofindandanalyze genetic and other omics dataset fromvarious global sources in the easiest, fastest, andmostefficientway.Onthe‘Discover’page,withjustafewsimplesteps,theuserwillbeabletoquicklydiscoverandselectthedataaccordingtothediseaseof interest and filter the dataset based on theappropriate metadata for their study or researchquestion.Onestrengthoftheplatformisthattheusercanfindand combine open-source, in-house, andproprietary datasets in a matter of seconds. Theuserhas theopportunity toperforma free searchwithadvancedsearchoperators(e.g.searchingforgene-ordisease-name),orcanuseseparateclinicalsearchbars,andtherapeuticarea,diseaseordatasettitle.TheDiscoverwindowwill thenbepopulatedwiththesearchresults.Currently,thesearchqueryform supports DOID ontology, a comprehensivehierarchical controlled vocabulary for humandisease representation. but other standards (EFO,ICD10)areplannedtobeaddedinthefutureaswell.StratifyingPatientsOncethedatahasbeenselected,itwillbepossibleto further filter it based on the informationcontainedintheassociatedmetadatafiles,generallygender,diet,country,etc.,andalsothetypeofdatapermissionassociated.Forexample,itispossibletoselectonlyfreedatathatisnotassociatedwithanycosts. This fine-grained filtering feature thereforeallowstostratifypatientsbasedontheinformationtheuserisinterestedin,sothatitwillbepossibleto

selectthemostsignificantdatafortheanalyzestobeconducted. For example, once the database ispopulated,itshouldbepossibletoselectCOVID-19patients by ethnicity or smoking status to quicklyfind confounding factors that may affect theoutcomeandseverityofthedisease.Once the user has completed the selection of thedatathathastogointothedataanalysis,asimpleclickonthe‘ChoosePipeline’boxwillbringtheuserto thedata analyticsmarketplace for downstreamanalytics.DataAnalyticsMarketplaceAfter selecting datasets for analysis, the user isdirected to the data analytics marketplace. Themarketplace is designed to provide researcherswithorwithoutbioinformaticsexpertisetoperformawidevarietyofdataanalysesandtosharein-housepipelinesaswellasmachinelearningtoolswiththescientific community. Since the launch of theplatform, a variety of standard bioinformaticspipelines/workflows, primarily for secondaryanalyses, were already added to the marketplacethatare free touse, includingpipelines toanalyzeCOVID-19patients.TheShivomanalyticalplatformhosts a variety of genomic-based pipelines; fromraw assembly and variant calling, to statisticalassociation-based analysis for population-basedstudies.Intime,avastvarietyofnewdataanalyticstoolsandpipelineswillbeaddedtothemarketplace,includingAIanalyticsfeatures.Currently,allprovidedstandardpipelinesarebuilton Nextflow technology31. Nextflow is a domain-specific language that enables rapid pipelinedevelopment through the adaptation of existingpipelines written in any scripting language32.Nextflow technology was chosen for the platformbecauseitisincreasinglybecomingthestandardforpipeline development. Its multi-scalecontainerizationmakesitpossibletobundleentirepipelines,subcomponentsandindividualtoolsintotheir own containers, which is essential fornumerical stability. Nextflow can use any datastructureandoutputsarenotlimitedtofilesbutcanalso include in-memory values and data objects32.Another key specification of Nextflow is itsintegration with software repositories (includingGitHub and BitBucket) and its native support forvarious cloud systems which provides rapidcomputationandeffectivescaling.Nextflow is designed specifically forbioinformaticians familiar with programming.However,itwasouraimtoprovideaplatformthatcan be used by inexperienced users and withoutcommon line input. As such, the Shivom data

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 14: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

14

analyticsmarketplacewasdesigned to provide aneasytousegraphicaluserinterface(GUI)whereallpipelinescanbeeasilyselectedandconfigured,sothatnocodingexperienceisrequiredbytheuser.Inaddition to thedatamarketplace, this feature setstheplatformapartfromothercloudcomputingtoolsthatuseNextfloworotherworkflowmanagementtools suchasToil, SnakemakeorBpipe.Users caneasily automate and standardize analyses (e.g.GWAS, Meta-GWAS, Genetic Correlation, TwoSample Mendelian Randomization, Polygenic RiskScore,PheWAS,Colocalizationetc..)byselectingallparameters for sets of different analysis activities.This makes it possible to run analyses fullyautomated,meaningthatevennewuserscanstartusingsophisticatedanalysesbyrunningaworkflowin which all analysis steps are incorporated, i.e.assembledbymoreexperiencedbioinformaticiansormachinelearningexperts.

Figure 6: The data marketplace has a simple multiple-choice process: first, the user can choose which pipeline to use from a large variety of bioinformatics and AI pipelines, and can then define the parameters of the same. Findings that cannot be reproduced pose risks topharmaceutical R&D, causing both delays andsignificantlyhighercostsofdrugdevelopment,andalso jeopardizethepublic trust inscience(36).Byusingstandardpipelinesandallowfor3rdpartiestoshare their workflows on the platform, and bycreatingreproducibleandshareablepipelinesusingthepublicNextflowworkflowframework,notonlydo bioinformaticians have to spend less timereinventingthewheelbutalsodowegetclosertothegoalofmakingsciencereproducible.GWASpipelinetostudyvirus-hostinteractionsThe disadvantage of many research consortia,includingsomeCOVID-19consortia,isthattheydonot open the data to all consortium members toanalyze the collected data individually, but onlyprovidedataanalysisbyacentralcoordinator.Then

thesummarystatisticsoftheanalyzedcohortsarefedbacktotheconsortiummembers.Forexample,the Host Genetics Initiative agreed on astandardized Genome-wide association studies(GWAS) pipeline33. The Shivom platform worksdifferently, and all participants of a researchconsortium(andother3rdpartiesifpermitted)cananalyzethedatasetsindependently.TomakeGWASanalyses available to the whole researchcommunity,oneofthemostimportantpreinstalledpipelines on the platform is a standard GWASanalysisworkflowfordataqualitycontrol(QC)andbasicassociationtesting.Genome-wideassociationstudies are used to identify associations betweensingle nucleotide polymorphisms (SNPs) andphenotypictraits34.GWASaimstoidentifySNPsofwhichtheallelefrequenciesvarysystematicallyasafunctionofphenotypictraitvalues.Identificationoftrait-associatedSNPsmayrevealnewinsightsintothe biological mechanisms underlying thesephenotypes. GWAS are of particular interest toresearchersthatstudyvirus-hostinteractions.The SARS-CoV-2 virusdoesnot affect everyone inthe same way. Some groups seem particularlyvulnerabletosevereCOVID-19,notablythosewithexisting health conditions or with geneticpredispositions. Variation in susceptibility toinfectiousdiseaseanditsconsequencesdependsonenvironmentalfactors,butalsogeneticdifferences,highlightingtheneedforCOVID-19hostgeneticstoengagewithquestionsrelatedtotheroleofgeneticsusceptibility factors in creating potentialinequalities in theability toworkoraccesspublicspace, stigma, and inequalities in the quality andscopeofdata35.COVID-19pipelinesThe genome sequencing of the coronavirus canprovidecluesregardinghowthevirushasevolvedand variants within the genome. The highlightedplatformcontainsasetofbioinformaticspipelinesthat allow the interrogation of host and virusgenomes. Genetic information might enabletargeting therapeutic interventions to thosemorelikely todevelopsevere illnessorprotecting themfrom adverse reactions. In addition, informationfromthoselesssusceptibletoinfectionwithSARS-CoV-2 may be valuable in identifying potentialtherapies.The current version of the platform comes withpreinstalled COVID-19 specific pipelines coveringassembly statistics, alignment statistics, virusvariant calling, and metagenomics analysis, andotherpipelinesthatcanbeusedforanalyzingCovid-19 patient data such as GWAS orMetaGWAS. The

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 15: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

15

research community is encouraged to add morepipelinesandmachinelearningtoolstosupportthefight against the pandemic. All the standardpreinstalledCOVID-19pipelinesworkwithFASTQsequencingfiles.TheplatformworkswithIlluminashortreadsaswellasOxfordNanoporelongreads.Longandshortreadsdifferaccordingtothelengthof reads produced and the underlying technologyused for sequencing. The short-read length (100-200 bp) limits its capability to resolve complexregionswithrepetitiveorheterozygoussequences,so the longer read lengths (up to kbs) havefundamentally more information. However, thesetendtosufferfromhighererrorratesthantheshortreads.Theshortreadscomeinzippedformat(.gz)anddonot need to be extractedbefore use; sincethere are paired-end, each sample will have twofiles:R1andR2FASTQthatneedtobeincludedinananalysis.LongreadscomeassingleFASTQfiles.Each analysis requires a sample metadata sheet,which specifies the FASTQ.gz files for each of thesample and other associated metadata files. Atemplate metadata file can be accessed on theplatform, in which the user can also find thedescriptionofallcolumnsandtheexamplevalues.MetagenomicAnalysisOne of the common analyses done on patientsequencesistoidentifythepathogenscontainedintheanalyzedsampleandtoassigntaxonomiclabels,usually obtained through meta-genomic studies36.Metagenomics allows researchers to create a

pictureofapatient’spathogenswithouttheneedtoisolate and culture individual organisms. ThepredictionpoweronCOVID-19alsodependsondatarelated to underlyingmorbidities, including otherflu-like illnesses. Thoughmetagenomic testing forothervirusescanthereforeincreasethespecificityoftheGWASstudiesandotheralgorithms.The Shivomplatform comeswith ametagenomicsanalysispipeline that isbasedon thenf-core/viprpipeline, a bioinformatics best-practice analysispipeline forassemblyandintrahost low-frequencyvariant calling forpathogen samples.Thepipelineuses the Kraken program metagenomicsclassification37–39.Usingexactalignmentofk-mers,the pipeline achieves classification significantlyquickertothefastestBLASTprogram.Theprocessofrunningpipelinesontheplatformarelargely the same and do not require in-depthbioinformatics expertise or command line entry.The user is guided through all steps. After theupload of the selected FASTQ files and associatedmetadata, the user can choose to add a qualitycontrol and then tomodify the parameters of thepipeline,ifrequired.Afterstarting thecomputation, theusercaneasilymonitor theprogressof thepipeline andall othertasksinthe‘Projects’tab(seeFig.8).

Figure 7: The Shivom COVID-19 Metagenomics Pipeline. The pipeline takes FASTQ (short- and long reads) together with metadata and will provide a comprehensive analytics report in PDF format and as interactive on-screen report.

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 16: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

16

Figure 8: Top: Data & Parameters. After selecting the input files, the user can select and change the pipeline parameters, or can choose the default parameters recommended. The focus of the platform is on ease-of-use, no coding expertise is required. Bottom: The Projects view allows easy monitoring of all pipelines in progress and completed.

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 17: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

17

By pressing on a project, the user can see moredetails about the status of the computation andparameters used. If required, it is possible todownload a log file and to monitor the costs ofcomputations.In case a pipeline fails, e.g. in the case there areproblemswith therawdataormetadatasheet,allprocessesthepipelinewentthrougharemonitoredin real time with additional information on the

computation’sCPUloadandmemory.Inthiswayitiseasytofindthestepwhereapipelinefailed,andtheerrorcanbeaddressed.After the pipelines are finished, the user canmonitortheresultsbyclickingthe‘Results’button.All the pipeline-specific plots are presented in aninteractiveviewandcanbeexportedtoaPDFreportthatcontainsalltheparametersandsettingsoftheanalysisforeasyresultssharing.

Figure 9 Top: The Pipeline view allows the user to monitor running and completed processes. Bottom: A list of all pipeline processes allows the user to monitor progress and helps in identifying potential errors.

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 18: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

18

Figure 10: A: Sequence Counts of the patient samples. This plot will give a picture of how much data was generated by the sequencing machine for each sample in terms of number of reads. B: This plot shows the GC content across the whole length of each sequence. The GC content of the plot indicates the average GC percentage within each sample. The distribution should ideally show normal distribution. Any skew in the curve indicates either contamination or some problem with library preparation or PCR steps in the experiment. C: Alignment Stats. There are two levels of alignment, first, alignment to the human reference genome, and second the unmatched reads aligned to the coronavirus reference genome. The mapped sequences are taken for variant calling. D: Species Profile. The plot displays the pathogens identified in the patient sample (e.g. saliva swab); in this case both individuals sequenced have SARS-CoV2 virus sequences identified. Byinterrogatingthespeciesprofile,thepresenceofSARS-CoV2 can be confirmed and the presence ofother pathogens that may add to observedphenotypescanbeanalyzed.Theotherpreinstalledpipelinesalsoworkwithpatientsequencingfiles:COVID-19AssemblerThis is a bioinformatics pipeline for SARS-CoV2sequence assembly, by aligning and mergingfragments from a longer viral DNA sequence inordertoreconstructtheoriginalsequence.Therawdata, represented by FASTQ files, a mixture ofhumanandpathogensequences,aretakenasinputand thenchecked forquality.After removalof theadapter sequences, the reads are mapped to thehuman reference genome (hg19;GRCh37.75). Theunaligned, viral sequences are then taken for denovo assembly using the Spades program andevaluatedusingtheQuastprogram.Theassembledgenome files are viewable in programs such asBandage. The contigs are subjected to gene/ORFpredictionandtheresultingsequencesare furtherannotatedusingPROKKA.COVID-19AlignmentThis pipeline produces variant files and anannotation table from the SARS-CoV2 virusespresentintheCOVID-19patient.Unlikegoingforade novo assembly, the user can directly align thereads to the availablepathogen reference genomeand identify the variants presented within theanalyzedsamples.Theresultingvariantcallscanbe

used in downstream analyses to identify cluesregarding how the virus has evolved and thedistributionofvariantswithinthevirusstrain.Therawdata(FASTQshortreads)fromthepatientaretaken as input and are checked for quality withFastqQC. The reads are thenmapped to the hg19humanreferencegenome.Theunalignedreadsarethen taken for alignment with the coronavirus(NC_05512)referencegenome.Thealignmentsarechecked forduplicatesandrealignedusingPicard.Thevariantscallingtools(lofreq)areusedontheserealigned files toget the listof thevariants in thepatient’s viruses. The variants are annotatedwithsnpEff.COVID-19VariantCaller(Patient)This pipeline is designed to produce VCF variantfilesfromtheCOVID-19patientssequencingreads.TheresultingVCFfilescanbevaluableinidentifyingthe variants associated with the coronavirusinfection and can be easily exported and used fordownstream analyses. The pipeline performs aqualitycheckonthereadsandtrimstheadapters.Thesetrimmedreadsarethenalignedtothehumanreference genome (hg19). The alignment files arechecked for duplicates, realigned and variants arecalled. These variants are annotated using thesnpEFFprogram.Overall, thewholeworkflow, from data upload torunningaCOVID-19relatedpipelinewasdesignedto be as easy as possible, with only a few stepsinvolvedandnocommandlineentryneeded,sothat

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 19: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

19

researchers and healthcare professional can runanalyses of patients sequences and share the filesand results with the research community.ResearcherswithotherNextflowpipelines relatedto tackle the COVID-19 pandemic are highlyencouragedtosharethemontheShivomplatform.

ResultsandDiscussionThemaingoalofthisprecision-medicineplatformistoprovide theglobalresearchcommunitywithanonline marketplace for RAPID data sharing,managing&analytics capabilities, guidelines,pilotdata, and deliver AI insights that can betterunderstandcomplexdiseases,aging&longevityaswell as global pandemics at the molecular andepidemiologicallevel.Amongotheruse-cases, theprovidedplatformcanbe used to rapidly study SARS-CoV-2, includinganalysesofthehostresponsetoCOVID-19disease,establish a multi-institutional collaborative data-hub for rapid response for current and futurepandemics, characterizing potential co-infections,and identifying potential therapeutic targets forpreclinical and clinical development. No patientdatacanbetracedbacktothedatadonor(patientordatacustodian)becauseonlyalgorithmscanaccessthe raw data. Researchers querying the data-hubhavenoaccesstothegeneticdatasource,i.e.norawdata(e.g.nosequencingreadscanbedownloaded).Instead, researchers and data analysts can runpredefinedbioinformaticsandAIalgorithmsonthedata. Only summary statistics are provided (e.g.GWAS,PHEWAS,polygenicriskscoreanalyses). Inaddition to genetics bioinformatics pipelines,specialized COVID-19 bioinformatics tools areavailable (e.g., assembly and mapping pipelines).Using these free tools, researcherscanquicklyseewhy some people develop severe illness whileothers do show only mild symptoms or nosymptomswhatsoever.Wemighthypothesizethatsome people harbor protective gene variants, ortheir gene regulation reacts differently when theCOVID-19virusattacksthehost.Particularly during an early outbreak, virus-hostgenetictestinghasvalueinidentifyingthosepeoplewhoareathighorlowriskofseriousconsequencesof coronavirus infection35. This information is ofvalueforthedevelopmentofnewtherapies,butalsofor considering how to stratify risk, and identifythosewhomightrequiremoreprotectionfromthevirus due to their genetic variants. Identifyingvariantsassociatedwithincreasedsusceptibilitytoinfection,orwithseriousrespiratoryeffects,relies

ontheavailabilityofgeneticdatafromtheaffectedindividuals. However, here COVID-19 researchencounters challenges associated with thepopulation distribution of genetic data, and theconsequentprivilegingofspecificgroupsingeneticanalyses35. Most data sets used for genome-wideassociation studies are skewed toward Caucasianpopulations, who account for nearly 80% ofindividuals in GWAS catalogs40. In comparison,those fromAfrican andAsian ancestry groups arepoorly represented. Even large-scale geneticanalyses may consequently fail to identifyinformative variants whose frequency differsamongpopulations,eitherunder-oroverestimatingriskinunderstudiedethnicsubgroups.Datafrommanycountriesshowedthatpeoplefromblackandminorityethnic(BAME)groupshavebeendisproportionately affected. People from a BAMEbackground make up about 13% of the UKpopulationbutaccountforathirdofviruspatientsadmitted to hospital critical care units. BlackAmericans represent around 14% of the USpopulationbut30%of thosewhohavecontractedthe virus. More than 70% of healthcareprofessionalswho have died in theUK have beenfromBAMEbackgrounds.Similarpatternsshowingdisproportionate numbers of BAME virus victimshave emerged in the US and other Europeancountrieswithlargeminoritypopulations.IntheUSandUK,theproportionofcriticallyillblackpatientsis double that of Asians, so geography andsocioeconomicfactorsaloneareunlikelytoexplainthestarkdifferences.Thekeytofightingearlydiseaseoutbreaksisspeed.Withtherealitythatsomeinfectionsmayoriginatefrom unknowing carriers, FAST patientstratification and identification is even moreparamount. It is important to investigate if theseverityofthediseasemaylieintheinteractionwiththe host genome and epigenome. Why do somepeoplegetseverelysickanddiewhileothersdonotshowanysymptoms?Itcouldbethatsomepeopleharbor protective gene variants, or their generegulationisabletodealwithvirusesattackingthehost.Veryquickdatasharingandanalysisensuresthatprecisionmedicine isbrought topatientsandhealthy individuals faster, cheaper, and withsignificantlylesssevereadverseeffects, leveraginginformation from the interaction between labs,biobanks,clinics,CROs,investigators,patientsandavarietyofotherstakeholders.Changingtheglobaldata-sharingparadigm

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 20: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

20

Asoutlinedinthispaper,wehavethetechnologytomake easy data sharing a reality. However, theobservationthatdatasharingtechnologiesarestillextremelyunderutilized andmostdata custodiansarereluctanttosharetheirdata,demonstratesthatthe global healthcare community needs anelemental paradigm shift, a major change in theconcepts and practices of how data is shared andutilized,particularlyduringapandemic.Aparadigmshift, as outlined by the American physicist andphilosopherThomasKuhn,requiresafundamentalchange in the basic concepts and experimentalpracticesof a scientificdiscipline.KuhnpresentedhisnotionofaparadigmshiftinhisinfluentialbookThe Structure of Scientific Revolutions. We arguethattheworldneedssucharevolutionaryparadigm.Researchersusereal-worlddatafromlaboratories,testing centers, hospitals, academic institutions,payors,wearables,andotherdatasourcessuchasdirect to consumer genetic companies to betterunderstand disease outbreaks and to developvaccines andmedicines quickly. Suchdata is usedacross the whole spectrum of drug discovery,vaccinedevelopment,andbasicacademicresearch.However,whoeverhaseverappliedfordataaccesstoamajorbiobankknowsthattypically,it’scostly,time-consuming and nerve wrecking, in the best-casescenario,toobtainsuchdata.Data is often received from a third-party partnerthataggregates,anonymizesit,andmakesitusablefor their research purposes. Often, there is asignificant lag time between when the data isgenerated and when it’s available for use. Forexample,whenaninfectedpatientissequencedinahospital,itcantakebetweenweeksandevenyearsuntil thedata surfaces and is available in apublicdatabase. But if healthcare systems are trying tocontrol a pathogen outbreak and want tounderstand more about the affected patients andhost-virusinteractions,thingsarechangingdaybyday,weneedthatinsightintothereal-timeaspectofwhat’sgoingon.What we need is an ecosystem that, for example,allows a doctor who sees a patient and sends abuccal swabout forsequencing, togetsequencingresults, sharing it with the global community andgets in-depth data analytics back to evaluate if asimilar host-virus interaction was observedanywhere else on the planet -within 24 hours orless.Inthecurrentpandemic,forexample,thereisaneed to understand COVID-19 patient data at thelocallevel,directlyfromthesources,togetdatainreal time. Omics data from patients, exposedindividuals, and healthy control subjectswill be acritical tool in all aspects of the response—from

helping to design clinical trials, identifyingvulnerable groups of the population, tounderstanding whether COVID-19 patients takingspecificmedicinesareathigherriskofdevelopingsevereside-effects.IncentivesforsharingdataTheabilitytocollect,analyze,andinterpretdataisfundamental to the management of infectiousdiseases. To do so, stakeholders need to beincentivizedtosharetheirdata.Sharingdataisgoodforscience,notonlybecausetransparencyenhancestrust in science, but because data can be reused,reanalyzed, repurposed, and mined for furtherinsight. Sharing data on the Shivom platform canincrease transparency. The more transparent thecontributorsareabouttheirdata,analyticmethods,oralgorithms,themoreconfidenttheyareoftheirwork,andmoreopentopublicscrutiny.Usingtheplatform can also help researchers and consortiathat want to collaborate within a safe space. Weneedmoredataaggregatorsthatarecommittedtosupportinganecosystemofresearchcommunities,businesses,andresearchpartners,bysharingdataoralgorithmsinsafeandresponsibleways.Suchanopenecosystemapproachcanyieldhighdividendsforsociety.Inaddition,consortiacancometogethertobuildpatientcohorts,maximizingtheirresearchbudget. Research organizationswho share similargoals can join efforts to create insight at scale,creating datasets that meet the needs of multiplestakeholdersatlowercosts.A successful response to the COVID-19 pandemicrequiresconvincinglargenumbersofscientistsandhealthcare organizations to change their datasharing behaviors. Although the majority ofcountries have population studies and buildmassivedigitalbiobanks,toourknowledge,noneofthelargerbiobanksintheworldhaveadatasharinginfrastructure to share data with other countries.The Shivom platform provides the only cross-border datamarketplace to allow data custodiansand biobanks to easily & securely share andmonetize anonymized datasets. A massive,decentralized,andcrowd-sourceddatacanreliablybe converted to life-saving knowledge ifaccompaniedbyexpertise,transparency,rigor,andcollaboration. Overall, there are several obviousadvantagesusingtheplatform:● Makingdataactionableasdatacanbeanalyzedonthebioinformaticsplatform

● Breakingdowndatasilos,puttingdatainalarger,globalcontext

● Makingdataeasilyfindable&searchable

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 21: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

21

● Abilitytoeasilysharedatawithfine-grainedpermissionwithclients,collaborators,orthepublic

● Improvedaccessibilitybyavoidingcomplicatedaccessprocedures,promotingmaximumuseofresearchdata

● Advancedinteroperabilitybyintegratingwithapplicationsorworkflowsforanalysis,storage,andprocessing;e.g.newAIalgorithms

● Betterreusabilitysothatdataisnotstoredandforgotten,butreusableinmanyresearchprojects

● Increasedimpactassharingdatawillhelptheglobalfightagainstcomplexandrarediseases

● Gainreputationfortheuser’sorganizationbygivingresearcherstheabilitytofind,useandcitetheirwork

DataMonetizationOptions.Usually,formostdatacustodianssuchasclinicallaboratories,dataclean-upproceduresanddatadepositionarenot formalresponsibilities.Thesejobshavetobedonebysomefairly busy people, in an increasingly demandingenvironment.Thereisnoremunerationforthelabsto submit their data; but done out of goodwill. Aslong as the data collection is funded by publicinstitutions this is no major problem, but privatehospitals,smallbiotechorgeneticcompanies,datasharing can become a challenge. The Shivomplatform offers the possibility to monetize theirdatasets. Practically, that means if a dataset ismonetizable,wheneverabioinformaticspipelineoralgorithm touches the dataset, a smart contract istriggered that sends a predefined fee to the datacustodian.Thepricingcanbeadjusteddependingonthequality,datatypeandotherfactors.While open data sharing is encouraged for thepublicly funded research community, suchmonetization options can be a valuable extraresource of funds for organizations that have toworry about sustainability and revenue models.Unlikemanyotherdatabrokers,organizationscanupload and monetize data directly without adistributor or middleman. This leads to real-timepayouts, higher revenue share rates, and instantuploadavailability.Itallowsresearcherstouploadtheirowndataandgetitinfrontofanaudienceofother scientists. In a way, it almost acts like acontent creation tool, rather than a pure datasharingservice.From the data-user perspective (e.g. a pharma orbiotech company), getting access to large geneticdatasetsgetsmucheasierandmoreaffordableasitis not necessary to acquire exclusive licenses thatcan costmillions of dollars. Instead, thedata user

onlyhastopayforaccessingthedatathatisactuallyused, similar to the way people access music onstreaming services. In addition, since the datacustodianremainsfullownershipofthedata,thereis no need for complicated business relationships,norwilltheirinternalbusinessmodelbechallenged.TacklingDataCurationDeptUsing the Shivom platform as the regular datarepository can help researchers with their datacuration tasks, preventingwhatwas termed ‘datacuration dept’41. Researchers can migrate/copytheir data to the platform to protect themselvesfrom breaking in-house legacy hardware, e.g. oldunmaintainedservers.Also,digitizingrawdataandstoring it increases security of the data as manydatasetsarestillstoredondecayingphysicalmedia(e.g. consents done on paper,metadata stored onCDs/DVDs,MiniDisc,memorysticks,externalharddrivesetc.).Ifthephysicalmediumdecays,thentherawdataarelost,makingitimpossibletoreproducethe original research or apply new techniques toanalyzetheoriginalrawdata.Lossofdatahasfar-reachingconsequences:arecentsurveydemonstratedthatthelackofavailabilityofrawdataanddataprovenancewerecommonfactorsdriving irreproducible research42. For manyacademiclabs,thelossofdatacouldleadtoalossoffundingbecausefuturegrantapplicationswouldnotbe able to list their data as a resource. Even fornormaldailytasksinatypicaluniversitylaboratory,if data is not lost but requires additionalwork tolocate, for example, after a PhD student left theresearch group, then an additional time and staffcost is involved to find the data and re-learningknowledge that was the domain of the departedteam member. Particularly in multinationalpandemiccohortstudieswheretherecruitmentanddatacollectionaresodifficult,itisunacceptablethatthereissucharisktothehugelyvaluableresourceof historical data through the accrual of datacurationdebt41.Joining forces and sharing information at thenationallevelalsoeasesandsupportsinternationalcooperation initiatives. National coordination ofpandemic policy responses can also benefit fromjoining forces with international researchcooperation platforms and initiatives. It isimportanttonotethatdatastoredandsharedontheShivomplatformdoes not need to be exclusive tothis data marketplace; the data can be hosted inotherdatabasesaswell.Thebenefitsofdatasharingdonotendhere,otherbenefitsthatshouldincentivizeresearcherstosharetheir datasets include the chance of getting more

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 22: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

22

citations,increaseofexposurethatmayleadtonewcollaborations,boostingthenumberofpublications,empowering replication, avoiding duplication ofexperiments, increasing public faith in science,guiding government policy, andmanymore. Also,farfrombeingmererehashesofolddatasets,moreand more evidence shows that studies based onanalyses of previously published data can achievejust as much impact as original projects andpublishedpapersbasedonshareddataare justaslikelytoappearinhigh-impactjournals,andarejustas well-cited, compared with papers presentingoriginaldata43,44.Buildingamassiveglobaldata-hubwithitswealthof multi-omic information is a powerful tool tostudy,predict,diagnoseandguidethetreatmentofcomplex disease and help us to live longer andhealthier.Also,datasharingshouldnotstopwithasuccessfulvaccinerollout.Healthcare systems need to monitor closely thecharacteristics of the most promising vaccinecandidates. Such monitoring includesunderstanding the patient's response, dosingregimen, potential efficacy, and side effects.Measures to monitor and control the COVID-19pandemic by the scientific community will berelevantforaslongasitsriskcontinues.RegulatoryandethicalconsiderationMoving genomics data and other information intoroutinehealthcaremanagementwillbecritical forintegratingprecisionmedicineintohealthsystems.However, data sharing is often hindered bycomplicated regulatory and ethics frameworks.Obviously, there are ethical and legal issues toconsider in any work that involves sharing ofsensitivepersonaldata.Nevertheless, it isobviousthatweneedtobeabletosharedatatofightglobalpandemics. The SARS-CoV2 virus does not careabout jurisdictionalboundaries.Data fromCOVID-19 patients enables experts to interpret complexpieces of biological evidence, leading to scientificandmedicaladvancesthatultimatelyimprovecarefor all patients. But these advantages must bebalanced against the duty to protect theconfidentialityofeachindividualpatient.Itisoutofthescopeofthisstudytodiscussethicsindepth. The Shivom platform is built to beoperational in all jurisdictions around the world.Manyregulatory frameworksdoalreadyexist thatallow to collect and hold anonymized data frompatientsintheinterestofimprovingpatientcareorin the public interest. It is now time to go a stepfurther and decouple procedures from regionalregulatory frameworks, tomakedatasharingeasy

acrossbordersandregulatory frameworks. In thiscontext,themagicwordisanonymization.Ontheoutlineddatasharingplatform,noaccesstorawdataispermitted.Onlyalgorithmsthatprovidesummary statistics are able to touch the data,guaranteeingthatnodatasetcanbetracedbacktoan individual. This process makes it possible toshare important multi-omics informationanonymously and securely, whilst still enablinglinkage to metadata such as diagnosis, lifestylefactors,treatmentandoutcomedata.Thekeyisthatdata on the platform is largely protected from re-identificationofindividuals.Re-identificationinthiscontext isunderstoodaseither identitydisclosureor attribute disclosure, that is, as either revealingthe identityofaperson, thusbreachinghisorheranonymity,ordisclosingpersonalinformation,suchas susceptibility to adisease,which is abreachofprivacy20. The Shivom platform was designed tomake it impossible to disclose information fromgeneticdatanotsolelyontheindividual,butalsoontheir relatives and their ethnic heritage, e.g.protectingtheidentityofsiblingsorotherrelativesinthecontextofforensicinvestigations.Ideally, patients should be involved in such datasharinginitiatives.Institutionsthatcollectandhostdataandpromotedata sharing shouldexplain thebenefitsofcollectingandusingpatientdatatotheirclients. There is a need for public education ongenetics, to take the frightening aspect away,ultimately incentivizing patients and healthycontrolgroupstosharetheirdataanonymouslyforthegreatergood.EasyPublicAccessOne of the cornerstones of the proposed dataecosystemiseasypublicaccesstodata. Ifadoptedbytheresearchcommunity,researchersseekingtousepublicdatamustnolongercombwebsitesandpublic repositories to find what they need. It isimportantthateveryhumanbeinghasthepotentialto participate in this process, either on the dataprovidersideorbyanalyzingdatasets.Accesstothedata should not be restricted to certainorganizationsorresearcherswithalongpublicationrecord.Thesesystemsareoutdatedandarchaic.Usually, with most databases and biobanks, thebona fide user must apply for access with acorrespondingdataaccesscontrolbody,provideaninstitution IDsuchasOpenIDorORCID,andoftenprovide a long publication record. Usually, thatmeans they must register from, and be affiliatedwith, an approved research institute and need toregister their organizationwhichmeans theywillneed their organization's signing official to

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 23: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

23

participateintheregistrationprocess.Often,wheninterestedresearcherswanttoaccessdatainpublicdatabases, they are asked to submit a lengthyproposaldetailingthedatatheywanttouseandtherationalefortheirrequest.Mostofthetime,projectsmust involvehealth-relatedresearch that is in thepublicinterest.SometimestheresearcherneedstogothroughanInstitutionalReviewBoard(IRB)firstand even then, all proposals received by the datarepositoryareconsideredonanindividualbasisandmayhavetobeevaluatedbyaninternalorexternalethics committee. On average, a straightforwardapplicationandapprovalprocesstakes2–3months;sometimesithastakenupto6months45.Obtainingaccesstodatathatareavailablefrominternationalsources adds another layer of complexity, due tointernationallaw.Most databases require the user then require astandardmaterial transferagreement(MTA)tobesignedpriortoanydatadeliverythatgovernshowtheresearchercanusethedata.Butevenwhendataaccess is granted, it is common that datasets arerestricted toonly thosedata andparticipants thattheresearcherrequiredatthattime(e.g.COVID-19patients only or specific case–control subsets). Inaddition,mostagreementsgrantaccessforalimitedtimeonly,attheendofwhichareportsummarizingresearch progress is required; for continued dataaccess,yearlyrenewalrequestsareoftenobligatory.Toexemplifytheproblem,ourresearchersappliedfor access to COVID-19 data with four data-providers(e.g.datarepositories,COVID-19researchconsortia, and biobanks) in the UK, EU and US tosupport our COVID-19 research efforts. All fourrequestswererejected,eachforadifferentreason.One public data repository denied access to theirdatabecausetheirlegaltermsonlyallowaccessforgroups working in public institutions but notprivate/commercial companies. Another biobankwasconcernedaboutthelackofpublicationsfromour research unit that could demonstrate theorganization’s current activity in health research,keeping in mind that our researchers are highlyrespected in their field and have a vast list ofscientificpublicationsfrompreviousappointments.OtherobjectionswerethattheproposedCOVID-19research was too broad, basically no exploratorystudies were permitted, or that our juniorresearcher had no publication record (5 or morepapers).OneCOVID-19dataconsortiumsuggestedthatourorganizationsharesourgeneticdatawiththembutdidnotallowaccesstotheirdatasetsandonly provided access to secondary analyses thattheir researchers performed on the data. Such

processesareoftenundemocratic,againstthepublicinterest and simply take too long to provide anybenefitunderahealthcrisis.Alargemajorityofbiobanksanddatabasesarenon-profitorganizations,andmostofthemoperateonanot-for-profit basis. Consequently, public fundingputssomeobligationondata-custodians toenableresearchwithanyscientificandsocialvalue.Againstthis background, we argue requiring datacustodians to prioritize access options and avoidunderutilization of data. Thus, biobanks ought tomakeallnecessaryarrangementsthatfacilitatethebestpossibleutilizationofthedatasets.Assuch,interestedresearchersonourdatahubdonot requirea longpublication record,nordo theyneedtoprovidea30-pageproposal,referrals,oramembership to an elite club (e.g. an accredited,approvedresearchinstitute)tosearchandanalyzedata. Usability and accessibility needs to beguaranteed for every interested individual, theacademic researcher from any elite university aswellasthepharmamanager,thePhDstudentfromKenia,thedoctoratalocalclinicinWuhan,astudentinaremoteareaofPakistan,orthenurseinavillagehospitalinBrazil.Inaddition,datashouldnotonlybe accessible for a specific research purpose butshouldenableresearchthatisbroaderinscopeandexploratoryinnature(i.e.hypothesis-generating).OtherUsecasesManydiseaseshideinplainsight–inourgenome–making genomics the most specific and sensitivewaytoidentifydiseaseearlyandguidebothpatientsand healthcare practitioners to the best course oftreatment. Yet despite many advances in dataanalytics andAI, our understanding of the humanpanome (including epigenome, proteome,metabolome, transcriptome, and so forth) isincomplete and the ways in whichwe utilize thisinformationislimitedandimperfect.Clinical Diagnosis. Having a secure data-hub forfastdatadisseminationandanalyticsintheclinicalsetting will improve routine clinical care. Mostcomplex diseases, such as cancer have becomeincreasingly dependent on genetic and genomicinformation. When biopsy samples are collectedfrompatientsandsequenced,thereistheneedtobeable, as quickly as possible, to tell whether anyvariantsdetectedareassociatedwiththediseaseorindicate that a patient is likely to respond to aparticular drug. At the same time, with everysequence shared, the accuracy of the globaldiagnostic accuracy improves, not only for thecurrent patient but for all cancer patients around

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 24: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

24

theglobe.Inaddition,linkingpatients’clinicaldatawith genomic data will help answer researchquestions relevant to patient health that wouldotherwisebedifficulttotackle.Rare Diseases. Access to genetic data and ‘real-worldevidence’(RWE)obtainedfromobservationaldataiscrucialtohelppushresearchandinnovationparticularly on rare diseases. The standardapproach to clinical trials is challenging, if notimpossible for most rare diseases. It is thecharacteristicsofrarediseasesthattheycanonlybeadequatelytackledifthereisenoughdatatomakeevidence-baseddecisions,giventhatthenumberofpatients is so small. To make this happen it isabsolutely necessary to break down global datasilos. To move things forward, many globalinitiatives, such as the EU Regulation on OrphanMedicinal Products (OMPs) were introducedspecifically to address the challengeofdevelopingmedicines that treat patients with rare diseases.However, the initial progress seen since theintroduction of these initiatives has tapered off inrecent years, with a decline in a number of drugapprovalsoverthepastyears. In fact,95%ofrarediseasesconditionsremainwithouttreatment.Oneofthemainhurdlesisthecollectionandaccesstodata,particularlyobtainedoutsidethecontextofrandomizedcontrolledtrialsandgeneratedduringroutineclinicalpractice.Gainingthecriticalmassofdata to close data gaps needs global coordinationand represents a critical step for driving forwardresearchintheareaoforphandrugs.Inaddition, thedescribedplatformcanbeused tocollect valuable datasets to specific themes andtherapeutic areas that are not yet available.Currently, several pilot data initiatives started tocollect research-area specific datasets on theShivomplatform:CannabisData-Hub.Incollaborationwithdirecttoconsumer companies, patient support groups andmedicalprofessionals,weaimatcollectingdataonthe interaction of individual’s genomes andmedicinal compounds found in medical andrecreational cannabis. Cannabis-related researchallows governments and pharma researchers tobetterunderstandefficacyandpotentialsideeffectsof hundreds of compounds like cannabinoids andterpenes and to improve legal frameworks forcannabis products. For example, several genevariations are known to affect the user'sendocannabinoid system, increasing sensitivity toΔ9-Tetrahydrocannabinol(THC),potentiallyposing

risk factors for THC-induced psychosis andschizophrenia46,47. Similarly, certain variations inAKT1canmakepeople’sbrainsfunctiondifferentlywhenconsumingmarijuana48.About10percentofoccasional cannabis users develop a physiologicdependence, cravings, or other addiction-relatedbehaviorsthatcanaffecteverydaylife.Asaresult,many cannabis users often go through variousproductsblindlybeforetheiridealbalancebetweentheirdrug’sefficacyandtheirtolerabilityisreached.Patients and their doctors can learn aboutpharmacogenetic aspects of cannabis use and beinformedontheirsusceptibilitiesandcanthereforeadjust cannabis habits to best fit their genotype.Having a comprehensive anonymized geneticdatabase available combinedwith realworld datawillpavetowaytobetterpainmanagementasit’sbeen demonstrated that cannabis may be apromisingoptiontoreplaceopiatesinthetreatmentof pain, as well as replacing other drugs such asantiepilepticsandantipsychotics49.The Longevity Dataset is an ongoing, concertedefforttoprovidegeneticandepigeneticinformationonindividualswhohaveattainedextremeages.Theremarkablegrowth in thenumberof centenarians(aged≥100)hasgarneredsignificantattentionoverthe past years. Consequently, a number ofcentenarian studies have emerged, ranging inemphasis from demographic to genetic. However,most of these studies are not aimed at breakingdowndatasilos.Assuch,datawillbecollectedfromcentenarians,withasubsetofpeoplewhoattainedan age of 110 years or higher – so-called“supercentenarians” toyield sufficientnumbers towarrant descriptive studies and to find geneticfactorsassociatedwithexceptionallongevity.Havingenoughdatasetswillimprovethesearchformeaningful subnetworks of the overall omicnetwork that play important roles in thesupercentenarians’ longevity50 and will help tobetter understand epigenetic mechanisms ofaging51,52.A data repository for the scientific publishingprocess.Havingdatadepositedinapublicdata-hubis of utmost importance, particularly when dataneedstobepublishedrapidly,asinatimesensitivepandemic. The case is easily demonstrated byretractionoftwoCOVID-19studiesinMayof202053.TheNewEnglandJournalofMedicinepublishedanarticle that found that angiotensin convertingenzyme (ACE) inhibitorsandangiotensin receptorblockerswerenotassociatedwithahigherriskofharm in patients with COVID-1954. The Lancet

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 25: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

25

published an observational study by the sameauthors, indicating that hospital patients withCOVID-19 treated with hydroxychloroquine /chloroquine were at greater risk of dying and ofventriculararrhythmiathanpatientsnotgiventhedrugs55. Both journals retracted these articlesbecausetheauthorssaidtheycouldnolongervouchfortheveracityoftheprimarydatasources,raisingserious questions about data transparency. Theproblem was that both studies used data from ahealthcare analytics company called Surgisphere.Afterseveralconcernswereraisedwithrespecttothe correctness of the data, the study authorsannouncedanindependentthird-partypeerreview.However,thedatacustodianSurgisphererefusedtotransfer the full dataset, claiming it would violateconfidentiality requirements and agreementswithclients,leadingtheauthorstorequesttheretractionof both studies. This problem could have beenavoided by publishing the datasets to a datarepositoryasthedescribedplatforminthisarticle.Journalreviewersandotherresearcherscouldhaveeasily accessed and analyzed the data beforeproblems arose, a method that could easily be amandatorypartofthesubmissionprocess.Theurgencyofsharingemergingresearchanddataduringapandemicisobvious.Rapidpublicationandearlysharingofresults isclearlywarranted,but itmeans that the research communitymust also beable to access and analyze the data in a timelymanner. Publishing research data to an openplatformwith audit features anddata provenanceincreasesthechanceofreproducibilityandtohavereliableassurancesfortheintegrityoftherawdata.Consideringthatwehavethetechnicalcapabilitiesto publish anonymized datasets, journals, inprinciple,shouldtrytohavetheirauthorspublicizerawdata inapublicdatabaseor journalsiteuponthe publication of the paper to increasereproducibility of the published results and toincrease public trust in science56. In this context,sharing raw omics data publicly should be anecessaryconditionforanymasterorPhDthesisaswell as research studies to be considered asscientifically sound, unless the authors haveacceptablereasonsnottodoso(e.g.datacontainsconfidentialpersonalinformation);keepinginmindthat fine-grainedpermissionsettingsalsoallow topublish and analyze industrial & proprietaryresearch.Overall,byutilizinganeasy-accessibledata-hub,itispossibletoincorporatevaluableomicsinformationinto routine preventive care, research andlongitudinalpatientmonitoring.Intheend,wecan

leverage the benefits of omics to expand patientaccesstohigh-qualitycareanddiagnosis.NextStepsIt is the aim of the authors to further evolve theproposeddatasharingandanalyticsplatformtothepoint that it is awidely used and truly actionabledata hub for the global healthcare ecosystem.Together with the research community, we areworkingonaddingavarietyofotherfunctionalitiesto make this precision medicine ecosystem morevaluableandinnovative,particularlyinthefieldsofartificial intelligence, interoperability, pandemicpreparedness,andothernewresearchfields.Long-termthiscanincludeprojectstoaddfunctionalitiesfor clinical trials, AI marketplaces, IoT data, orquantumcomputing.Someoftheplannedaddonsinthe roadmap for future releases include a cohortbrowser, improved search engines, many newpipelinesandAItools,aswellasfullyfederateddatasharing capability.We plan to add a forum to theplatformtohelpresearchersexchangeinformationon datasets and data curation. Often, those whowant to make their data more open face abewilderingarrayofoptionsonwhereandhowtoshare it. Providing a space for sharing data andguidingyoungresearchersintheirtaskbysharingnecessaryexpertise indatacurationandmetadatawillhelptoensurethatthedatatheyplantoshareareusefulforothers.COVID-19CalltoactionThe urgency of tackling COVID-19 has ledgovernmentsinmanycountriestolaunchanumberofshort-noticeandfast-trackedinitiatives(e.g.callsfor research proposals). Without propercoordinationamongstministriesandagencies,theyrun the risk of duplicating efforts or missingopportunities, resulting in slower progress andresearch inefficiencies. The bestway to prevent afurther spread of the virus and potential newpandemicsistoapplythesameprinciplesasweuseto prevent catastrophic forest fires: surveyaggressivelyforsmallerbrushfiresandstompthemout immediately. We need to invest now in anintegrated data surveillance architecture designedto identifynewoutbreaks at thevery early stagesand rapidly invoke highly targeted containmentresponses.Currently,ourhealthcare system isnotsetup for this.First,modernmoleculardiagnostictechnologiesdetectonlythoseinfectiousagentswealready know exist; they come up blank whenpresentedwithanovelagent.Infectionscausedbynew or unexpected pathogens are not identified

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 26: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

26

until therearetoomanyunexplained infectionstoignore,whichtriggershospitalstosendsamplestoa public health lab that has more sophisticatedcapabilities. And even when previously unknowninfectious agents are finally identified, it takes fartoolongtodevelop,validateanddistributetestsforthe new agent. By then, the window of time tocontain an emerging infection will likely be past.Usingtheplatformasdemonstrated inthisarticle,all thoseprocessescanbespedupdramatically. Itwouldenableourhealthcareecosystemtohaltthespreadofemerginginfectiousdiseasesbeforetheyreachthepandemicstage.Inthisnewsurveillancearchitecture proposed here, genomic analysis ofpatientspresentingwith severe clinical symptomswould be performed up front in hospital labspiggybackedonroutinediagnostictesting.Researchers can use the data in our data-hub tocompare new cases with existing datasets. Givennational epidemiological data on infection rates;where symptomatic people seek healthcare, aremarkably small number of surveillance siteswouldberequiredinordertoidentifyanoutbreakof an emerging agent. Using genomic sequencing,thenovelcausativeagentcouldbeidentifiedwithinhours, and the network would instantaneouslyconnect the patients presenting at multiplelocations as part of the same incident, providingsituationalawarenessofthescopeandscaleoftheevent.Usingagloballyaccessible,easy-to-usesolutionthatprovidesa ‘one-stopshop’forthecentralizationofinformationonnewvirus-hostinteractionscanhelpensurethatappropriateconditionsforcollaborativeresearch and sharing of preliminary researchfindings and data are in place to reap their fullbenefits. Beyond short-term policy responses toCOVID-19, incorporating data sharing with theShivom platform into ongoing mission-orientedresearchand innovationpolicies couldhelp tacklefuture pandemics on a national or internationalscale, for example to enhance datasetsinformativeness to reveal host infection dynamicsfor SARS-CoV-2 and therefore to improvemedicaltreatment options for COVID-19 patients. Inaddition, it can help to determine and designoptimaltargetsfortherapeuticsagainstSARS-CoV-2.Theplatformisinalignmentwithrecentguidelinespublished by the Research Data Alliance (RDA)COVID-19 Working Group57. The RDA broughttogethervarious,globalexpertisetodevelopabodyof work that comprises how data from multiple

disciplines inform response to a pandemiccombinedwithguidelinesandrecommendationsondata sharing under the present COVID-19circumstances.TheRDAreportoffersbestpracticesand advice – around data sharing, software, datagovernance,and legalandethicalconsiderations–for four key research areas: clinical data, omicspractices,epidemiologyandsocialsciences. In linewiththeserecommendations,theShivomplatformcanbeusedfor:● Sharingclinicaldatainatimelyandtrustworthymanner to maximize the impact of healthcaremeasures and clinical research during theemergencyresponse

● Encouragingpeople topublish theiromicsdataalongsideapaper

● Underlining that epidemiology data underpinearly response strategies and public healthmeasures

● Providingaplatformtocollect importantsocialand behavioral data (as metadata) in allpandemicstudies

● Evidencingtheimportanceofpublicdatasharingcapabilitiesalongsidedataanalytics

● Offeringageneralecosystemtoexploitrelevantethical frameworks relating to the collection,analysisandsharingofdatainsimilaremergencysituations

Atthispointintime,theresearchcommunityshouldbeclearthatasecondorthirdwaveoftheCOVID-19pandemicwillhitmostcountries.Inaddition,otherepidemicsandglobalpandemicswillfollow;itisjustamatterof time.As such, the global research andmedical community need to be prepared. WhenanalysisofdatasetssharedontheShivomplatformresults in publications, we encourage theresearchers to cite this publication to enhancewidespread adoption of this data sharing andanalyticsmarketplace.We also encourage sharingdata from COVID-19 and other virus outbreaks,includingdatafrompeoplewhowereexposedtothesame pathogen but did not get the disease orpossiblywereinfectedbutdidnotgetsymptoms.

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 27: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

18

ConclusionWe developed a next-generation data sharing- &dataanalyticsecosystemthatallowsuserstoeasilyandrapidlyshareandanalyzeomicsdatainasafeenvironment and demonstrate its usefulness as apandemic preparedness platform for the globalresearch community. Adopting the platform as adata hub has the clear benefit that it enablesindividualresearchersandhealthcareorganizationstobuilddatacohorts,makinglarge,orexpensive-to-collect, datasets available to all. In this way, datasharingopensnewavenuesofresearchandinsightsinto complex disease and infectious diseaseoutbreaks. This is not just true of large-scale datasharing initiatives, even relatively small datasets,forexamplerarediseasepatientdata,ifshared,cancontributetobigdataandfuelscientificdiscoveriesin unexpected ways. Particularly during earlydisease outbreaks, the patient-level meta-analysisof similar, past outbreaks may reveal numerousnovel findings that go well beyond the originalpurpose of the projects that generated the data.Using the platform as demonstrated will enhanceopportunities for co-operation and exploitation ofresults. Our data sharing approach is inclusive,

decentralized,andtransparentandprovideseaseofaccess. If adopted, we expect the platform tosubstantiallycontributetotheunderstandingofthevariabilityofpathogensusceptibility,severity,andoutcomesinthepopulationandhelptoprepareforfuture outbreaks. In providing clinicians andresearchers with a platform to share COVID-19relateddata,theclinicalresearchcommunityhasanimproved way to share data and knowledge,removingamajorhurdleintherapidresponsetoanewdiseaseoutbreak.

CompetingInterestsAS,IF,DB,andGLareemployeesofShivomVenturesLimited.

AuthorContributionsConceived and designed the platform: AS, IF.Analyzeddata:IF,GL,DB.Wrotethepaper:AS.

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 28: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

28

References 1. Zhou, P. et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579, 270–

273 (2020). 2. Li, Q. et al. Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus–Infected Pneumonia. N. Engl. J.

Med. 382, 1199–1207 (2020). 3. Boni, M. F. et al. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19

pandemic. doi:10.1038/s41564-020-0771-4. 4. Shivom. https://www.shivom.io/. 5. Lee, P. H. Sample sizes in COVID-19–related research. CMAJ vol. 192 E461 (2020). 6. European Commission COVID-19 Data Portal. https://www.europeandataportal.eu/en/highlights/covid-19. 7. Canadian COVID Genomics Network. https://www.genomecanada.ca/en/news/genomics-mission-meeting-covid-19-

challenge. 8. Genomics England and the GenOMICC (Genetics of Mortality in Critical Care) consortium. https://genomicc.org/. 9. The COVID-19 Host Genetics Initiative, a global initiative to elucidate the role of host genetic factors in

susceptibility and severity of the SARS-CoV-2 virus pandemic. Eur. J. Hum. Genet. 28, 715–718 (2020). 10. National COVID Cohort Collaborative (N3C). https://covid.cd2h.org/. 11. COVID-19 Genomics UK (COG-UK) Consortium. https://www.sanger.ac.uk/collaboration/covid-19-genomics-uk-

cog-uk-consortium/. 12. Secure Collective COVID-19 Research (SCOR) consortium. https://securecovidresearch.org/. 13. International COVID-19 Data Research Alliance. https://www.hdruk.ac.uk/covid-19/international-covid-19-data-

alliance/. 14. Covid Human Genetic Effort consortium. https://www.covidhge.com/. 15. Meredith, L. W. et al. Rapid implementation of SARS-CoV-2 sequencing to investigate cases of health-care

associated COVID-19: a prospective genomic surveillance study. Lancet Infect. Dis. 0, (2020). 16. Gøtzsche, P. C. Why we need easy access to all data from all clinical trials and how to accomplish it. Trials vol. 12

249 (2011). 17. The Variant Call Format (VCF) Version 4.2 Specification. (2020). 18. Faria, P. L. De & Cordeiro, J. V. Health data privacy and confidentiality rights: Crisis or redemption? Rev. Port.

Saude Publica 32, 123–133 (2014). 19. Krishna, R., Kelleher, K. & Stahlberg, E. Patient confidentiality in the research use of clinical medical databases.

American Journal of Public Health vol. 97 654–658 (2007). 20. Shabani, M. & Marelli, L. Re-identifiability of genomic data and the GDPR. EMBO Rep. 20, (2019). 21. Schumacher, A. REINVENTING HEALTHCARE ON THE BLOCKCHAIN - Toward a New Era in Precision

Medicine. Blockchain Res. Inst. (BRI), Toronto. Big Whitepapers. (2018). 22. Schumacher, A. Blockchain & Healthcare Strategy Guide 2017: Reinventing healthcare - Towards a global,

blockchain-based precision medicine ecosystem. (2017). doi:10.13140/RG.2.2.12162.48327. 23. Pattengale, N. D. & Hudson, C. M. Decentralized genomics audit logging via permissioned blockchain ledgering.

BMC Med. Genomics 13, 102 (2020). 24. Ozercan, H. I., Ileri, A. M., Ayday, E. & Alkan, C. Realizing the potential of blockchain technologies in genomics.

1–9 (2018) doi:10.1101/gr.207464.116. 25. Choudhury, O. et al. Enforcing Human Subject Regulations using Blockchain and Smart Contracts. Blockchain

Healthc. Today 1, (2018). 26. Dementia Platform UK (DPUK). 27. Charlson Comorbidity Index (CCI). 28. Jones, A. E., Trzeciak, S. & Kline, J. A. The Sequential Organ Failure Assessment score for predicting outcome in

patients with severe sepsis and evidence of hypoperfusion at the time of emergency department presentation. Crit. Care Med. 37, 1649–1654 (2009).

29. Celik, O., Koltka, N., Devrim, S., Sen, B. & Gura Celik, M. Clinical pulmonary infection score calculator in the early diagnosis and treatment of ventilator-associated pneumonia in the ICU. Crit. Care 18, P304 (2014).

30. Churpek, M. M. et al. Quick sepsis-related organ failure assessment, systemic inflammatory response syndrome, and early warning scores for detecting clinical deterioration in infected patients outside theintensive care unit. Am. J. Respir. Crit. Care Med. 195, 906–911 (2017).

31. Nextflow. https://www.nextflow.io/. 32. Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat Biotech 35, 316–319 (2017). 33. Zhou, W. et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic

association studies. Nat. Genet. 50, 1335–1341 (2018). 34. Marees, A. T. et al. A tutorial on conducting genome-wide association studies: Quality control and statistical

analysis. Int. J. Methods Psychiatr. Res. 27, (2018). 35. Milne, R. Societal considerations in host genome testing for COVID-19. Genetics in Medicine 1–3 (2020)

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint

Page 29: A global omics data sharing and analytics marketplace: Case … · 2020. 9. 28. · intelligence. Data sharing during a pandemic Global pandemics come and go. The tragic impacts of

29

doi:10.1038/s41436-020-0861-y. 36. Nurk, S. et al. Assembling genomes and mini-metagenomes from highly chimeric reads. in Lecture Notes in

Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) vol. 7821 LNBI 158–170 (Springer-Verlag, 2013).

37. Davis, M. P. a, van Dongen, S., Abreu-Goodger, C., Bartonicek, N. & Enright, A. J. Kraken: A set of tools for quality control and analysis of high-throughput sequence data. Methods 63, 41–49 (2013).

38. Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019). 39. Wood, D. E. & Salzberg, S. L. Kraken: Ultrafast metagenomic sequence classification using exact alignments.

Genome Biol. 15, R46 (2014). 40. Sirugo, G., Williams, S. M. & Tishkoff, S. A. The Missing Diversity in Human Genetic Studies. Cell vol. 177 26–31

(2019). 41. Butters, O. W., Wilson, R. C. & Burton, P. R. Recognizing, reporting and reducing the data curation debt of cohort

studies. Int. J. Epidemiol. 2020, 1–8 (2020). 42. Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016). 43. Data sharing and the future of science. Nature Communications vol. 9 1–2 (2018). 44. Milham, M. P. et al. Assessment of the impact of shared brain imaging data on the scientific literature. Nat.

Commun. 9, 1–7 (2018). 45. Learned, K. et al. Barriers to accessing public cancer genomic data. Scientific Data vol. 6 1–7 (2019). 46. Benyamina, A. et al. Association between ABCB1 C3435T polymorphism and increased risk of cannabis

dependence. Prog. Neuro-Psychopharmacology Biol. Psychiatry 33, 1270–1274 (2009). 47. Ortiz-Medina, M. B. et al. Cannabis consumption and psychosis or schizophrenia development. International

Journal of Social Psychiatry vol. 64 690–704 (2018). 48. Bhattacharyya, S. et al. Protein kinase B (AKT1) genotype mediates sensitivity to cannabis-induced impairments in

psychomotor control. Psychol. Med. 44, 3315–3328 (2014). 49. Aviram, J., Aviram, J. & Samuelly-Leichtag, G. Systematic Review Efficacy of Cannabis-Based Medicines for Pain

Management: A Systematic Review and Meta-Analysis of Randomized Controlled Trials. 50. Goertzel, B. & Duncan, M. Unraveling the Complex Genetics of Living Past 105.

https://blog.singularitynet.io/unraveling-the-complex-genetics-of-living-past-105-dc8f5f9ddeb0. 51. Schumacher, A. Aging Epigenetics. in Handbook of Epigenetics 405–422 (Elsevier, 2011). doi:10.1016/B978-0-12-

375709-8.00025-3. 52. Schumacher, A., Bihaqi, S. & Zawia, N. Epigenetics and Late-Onset Alzheimer’s Disease. Epigenetic Asp. Chronic

… 1–20 (2011). 53. Wise, J. Data transparency: ‘Nothing has changed since Tamiflu’. BMJ 369, m2279 (2020). 54. Mehra, M. R., Desai, S. S., Kuy, S., Henry, T. D. & Patel, A. N. Cardiovascular Disease, Drug Therapy, and

Mortality in Covid-19. N. Engl. J. Med. 382, e102 (2020). 55. Mehra, M. R., Desai, S. S., Ruschitzka, F. & Patel, A. N. RETRACTED:Hydroxychloroquine or chloroquine with or

without a macrolide for treatment of COVID-19: a multinational registry analysis. Lancet 0, (2020). 56. Miyakawa, T. No raw data, no science: Another possible source of the reproducibility crisis. Molecular Brain vol. 13

24 (2020). 57. RDA COVID-19 Working Group. RDA COVID-19 Recommendations and Guidelines RDA Recommendation (5th

Release). 124 (2020) doi:10.15497/RDA00046.

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 29, 2020. ; https://doi.org/10.1101/2020.09.28.20203257doi: medRxiv preprint