bioinformatics for the 100,000 genomes...
TRANSCRIPT
![Page 1: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/1.jpg)
Bioinformaticsforthe100,000GenomesProject
AugustoRendó[email protected]|GenomicsEnglandPrincipalResearchAssociate|UniversityofCambridge
Barcelona,2016-11-02
![Page 2: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/2.jpg)
Outline
• IntroductiontotheUK’s100,000genomesproject• Analysesinrarediseases• Analysesincancer• BioinformaticsPlatform• Datamodelsandflows• Databases• Interpretation
![Page 3: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/3.jpg)
Inceptionofthe100,000genomesproject(2012,2014)
“Ifwegetthisright,wecouldtransformhowwediagnoseandtreatourmostcomplexdiseasesnotonlyherebutacrosstheworld”(December2012)
“IamdeterminedtodoallIcantosupportthehealthandscientificsectortounlockthepowerofDNA,turninganimportantscientificbreakthroughintosomethingthatwillhelpdeliverbettertests,betterdrugsandaboveallbettercareforpatients.”(August2014)
![Page 4: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/4.jpg)
• Sequence100,000genomes
• Cancerandraregeneticdisease
• Capturedatadeliveredelectronically,storeitsecurelyandanalyseitwithinanEnglishdatacentre(readinglibrary)
• Combinegenomeswithextractedclinicalinformationforanalysis,interpretation,andaggregation
• Createcapacity,capabilityandlegacyinpersonalisedmedicinefortheUK
GoalsoftheGenomicsEnglandproject
1.TobringbenefittoNHSpatients
2.Toenablenewscientificdiscoveryandmedicalinsights
3.Tokickstart thedevelopmentofaUKgenomicsindustry
4.Tocreateanethicalandtransparentprogrammebasedonconsent
![Page 5: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/5.jpg)
GenomicsEnglandproject
http://www.genomicsengland.co.uk/library-and-resources/
![Page 6: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/6.jpg)
Recruitmentandclinicalinterfacevia13“GMCs”,ScotlandandNorthernIreland
• GenomicMedicineCentres• NetworksofNHShospitalsincludinggenomicslabs
• 13“Leadorganisation”plus71“LocalDeliveryPartners”
• ContractedbyNHSEngland• Coverrecruitment,dataandreturnofresults
• Scotland• Doingownsequencing
• NorthernIreland• SimilartoaGMC• ContractedbyNIpayer
+
![Page 7: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/7.jpg)
7
Feedbacktoparticipants
![Page 8: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/8.jpg)
AdditionalfindingGenes
Requirements:
• Atreatableorpreventablecondition.
• Reliablydetectedbynextgenerationsequencing.
• Eachgenewillhaveacuratedlistofhighconfidence,highpenetrancevariants.
Otherconditionsmaybeaddedifclinicallyappropriateandtechnicallyfeasible.
![Page 9: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/9.jpg)
ParticipantsrecruitedinRD• About400 RDparticipantscurrently
recruitedperweek• 5,000 participantsrecruitedtotheRDpilot
FamilySize
*DatafromMainProgramme
![Page 10: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/10.jpg)
Recruitmentbytumourtype
10
AdultGlioma,19,2%
Bladder,28,3%
Breast,321,29%
Childhood,1,0%
Colorectal,264,24%EndometrialCarcinoma,20,2%
Lung,139,13%
MalignantMelanoma,3,0%
Ovarian,91,8%
Prostate,95,9%
Renal,65,6%
Sarcoma,44,4%TesticularGermCellTumours,1,0%
![Page 11: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/11.jpg)
>14,896genomessequenced(Nov1)
NBasesx109
(Q30
-nod
up)
%Autosomalcoverage>=15x(Q30-nodup)
Germlinedataonly
• Median%Autosomalcoverage>=15X=97.4%• About1.4PBofdata
125150
![Page 12: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/12.jpg)
AnalysesinRareDiseasesGeneticbasedtest
![Page 13: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/13.jpg)
Checksofreporteddatavsgenetics
• Sexchecks• Coverage-based(WGS)• XchromosomeheterozygosityandYchromosomegenotypingrate(array)
• PredictedminorkaryotypesincludeXO,XXY,XYY• Relatednesschecks
• Mendelianinconsistencyrate(whereatleastoneparentsequenced)
• Estimatedidentitybydescentsharingforallpairsincohortandworkingonafamilyonlyworkflow- PLINKandPC-Relate
• Canidentifyrarephenomena,e.g.large-scaleuniparentalisodisomy
![Page 14: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/14.jpg)
Coveragebasedsexchecks
![Page 15: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/15.jpg)
Relatednesschecking
15
![Page 16: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/16.jpg)
AnalysesinCancerAssessingthequalityofsamplepreparationprotocols
![Page 17: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/17.jpg)
FreshFrozen(FF)vsFormalinFixedParaffinEmbedded(FFPE)
FF• Costlyandnotwidelyavailable• Difficulttocapturetumour• HighqualityDNA
FFPE• Routinelyused• Digitalpathologyfortumour
selection• Lowqualityandquantityof
DNA
![Page 18: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/18.jpg)
ATdropout GCdropout
FFsample 0.00 0.06 lowcoverageforGC-richregions
FFPEsample
0.16 -0.26 trendisreversedwithpoorcoverageof AT-richregions
ATrich GCrich
AT/CGdropouteffectoncopynumbervariantcalling
FFPEGCdropout
FFGCdropoutFFPEATdropout
FFATdropout
![Page 19: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/19.jpg)
FF ATdrop
Purity
RMSDcov
FFPE ATdrop
Purity
RMSDcov
4.7 0.6 13.1 5.8 0.6 18.9
4.0 0.4 13.2 5.4 0.4 24.5
4.3 0.5 14.3 6.6 0.5 22.4
4.4 0.4 12.9 15.8 NA 50.7
3.1 0.4 14.8 5.4 0.4 23.1
FreshfrozenandFFPEpairedsamples:abilitytocallCNVs
![Page 20: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/20.jpg)
OverlappingSNVsinFFandFFPEsamplesfrompairedVAF<5%filteredout
FFPEalsoaffectssmallvariantcallingProp
ortio
nofvariants
![Page 21: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/21.jpg)
GMC1OtherGMCs
Comparingsequencequalitymetricsacrosslabs
Afterstandardisingonoptimised FFPEprotocol
![Page 22: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/22.jpg)
Bioinformaticsplatform
![Page 23: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/23.jpg)
GELbioinformaticsplatform
DesignGoals• Scalability:abletooperateonseveralhundredwholegenomesperday• Traceability:abletokeeptheprovenanceofeveryartefactproducedintheprocess• Knowledgeaccumulation:abletocaptureandaggregatetheknowledge,decisionscapturedduringtheinterpretationinordertogeneratebetterknowledgebases• Serviceoriented:componentstalktoeachotherviawelldefinedAPIsanddataformats
![Page 24: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/24.jpg)
Hospita
lsGe
nomicsE
ngland
Interp.provide
rs
ClinicalDataintakeservice
InterpretationplatformservicesGenomeintakeservice
Workflowmanagement
Metadata Variants
ReferenceKnowledge
GxPassociations
Interpretation
Tracking
![Page 25: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/25.jpg)
Samedatamodel,manymanifestationsHowtoensurethatallthedataiscoherentlystoredandeasilyretrievable?
• InspiredbyModel-DrivenArchitectureapproaches• Models(schemas)controlledingithub includingboilerplatefunctionstovalidatedataagainstmodel• Documentationauto-generatedoutofthemodel• ServicescommunicateusingJSONderivedfromthemodel• Datawrittenagainsttheschemaauto-generatedfromthemodelinthemetadatastoreusingdocumentstores
![Page 26: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/26.jpg)
Datamodelsintheplatform
• Useofavro foritsinterfacedefinitionlanguage,JSONoutofthebox,automaticcodegenerationofclassestohandlethesedata• Models(andauxiliarylibraries)availablehere:https://github.com/genomicsengland/GelReportModels/tree/releases/schemas/IDLs• Documentationforthemasterbranchhere:https://genomicsengland.github.io/GelReportModels/index.html• Bioinformaticsmodelshere:https://github.com/opencb/biodata• ForreadsandvariantsweuseprotocolbufferscompatiblewithGA4GHstandards
![Page 27: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/27.jpg)
![Page 28: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/28.jpg)
InterpretedGenomeRD
![Page 29: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/29.jpg)
Bertha:Distributedworkflowmanagementsystem(reallyanenterpriseservicebusforgenomicdata)
Producer ConsumerExchangepublishes routes consumesQueue
MessageBroker
TrackingDB
JobScheduler
Dashboard
DeliveryAPI
Auditor
Orchestrator
GridConsumer
• Restarts• Scatter-gather• Singleandgroupprocesses• Multipleconcurrentworkflows
(workinprogress)
https://github.com/genomicsengland/bertha
![Page 30: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/30.jpg)
bertha_default 1.1.0
Single Sample QC & Processing
Analysis
Intake QC
Multi Sample QC
Cross Sample Contamination
Single-Sample QC Check Point
Identity by DecentMendelian Inconsistency Rate
Sex Check
Somatic VCF re-headering
Tumour Cross Sample ContaminationCross Species Contamination Depth of Coverage Concordance check
Intake QC Check Point
Merge Array Genotypes
Multi-Sample QC Check Point
Consent Check Point
Variant Calling
Variant Normalisation
Tumour PloidyTumour PurityTumour ClonalityMutation SignatureViral InsertionsActionable Mutation CoverageSNV & Indel RefinementMutation BurdenInbreeding Coefficient Homozygosity Runs
Variant Annotation
Variant Tiering
Interpretation Dispatch Exomiser
Delivery API
Integrity Check
MD5 Check
Validate BAM Picard
Filtered Bamstats Unfiltered Bamstats Q30 Bamstats VCF QC
Fix Permissions
Plot Filtered Bamstats Generate Filtered Metrics Bamstats Plot Unfiltered Bamstats Generate Q30 Metrics Bamstats
QC Stats Post-processing
Workflowdiagramme
Dataintake
SingleSampleQC&Processing
Multi-sampleQC
Analysis
SequencereceivedIntakeAPI
InterpretationRequestDispatched
![Page 31: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/31.jpg)
![Page 32: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/32.jpg)
Interpretationapproach
• VirtualGenePanels• Initiallyassignedbyaclinician• Workingonautomatedpanelsuggestions
• Variantfiltering• AlleleFrequency:variantisrare• Segregation:variantsegregateswithconditioninfamily• Panelmembership(includingmodeofinheritance)• Differentforcancer
• Interpretation• Automatedpathogenicityscoring• Manualreview
• SeveralmanualQCpoints
![Page 33: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/33.jpg)
Panelapp:Crowdsourcingcurationofgenediseaseassociations
https://bioinfo.extge.co.uk/crowdsourcing/PanelApp/
![Page 34: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/34.jpg)
StatusofPanels
• 190panels• 97>=v1panels• 3,512genes• 435registeredreviewers• 15,149genelevelreviews• RecognisedbytheUKgenetictestingnetwork• Curationreachingapointofdiminishingreturns
1
10
100
1000
10000
0 50 100 150
Numberofreviews
Reviewers
![Page 35: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/35.jpg)
Automatedpanelsuggestion
HP1
HP2
HP3
HP2
HP3
HP1
HP5
G1,G2,G3
G1,G4,G5
G6,G7,G8PanelZ
PanelY
PanelX
DiseaseX
X
Y
Z
Diseases,coreHPtermsandpanels
HP4
G1,G2,G3X
G1,G4,G5Y
G6,G7,G8ZHP4
G6
G7
HP2
HP3
HP4
Diseases,HPannotationandgenes
AlsogetaQCscoreforhowphenotypicallysimilarpatientistorecruiteddisease
![Page 36: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/36.jpg)
RDpilotbenchmarking• 1831participantswithHPOterms,assignedpanels(2674total)andcoredisease• 847/1831(46%)haveexactlysamepanels• 728/1831(40%)havesamepanelsplus1or2extra• 256/1831(14%)aremissingsomeofmedicalreviewpanels
7November2016 360
200
400
600
800
1000
1200
-900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 More
Freq
uency
Bin
Genegainsorlosses
![Page 37: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/37.jpg)
Filteringintherarediseasesprogramme
![Page 38: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/38.jpg)
Domain1
Variantsinavirtualpanelofactionablegenes(between20and40).Actionablegenesaredefinedasgeneswithshortvariantsassociatedwiththerapeutic,prognosticordiagnosticactionsbyGenomOncology (MyCancerGenome)
Matchingatthevariantlevel
Domain2
VariantsinthegenesfromCancerGeneCensus- 534genes.
Domain3
Variantsinallothergenes
Frequencyfilters:excludecommonvariants(1000G,ExAC,GEL)Consequencefilters:excludesynonymousvariants
Filteringinthecancerprogramme
Twopartreports:Actionableand“Interesting”
![Page 39: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/39.jpg)
Supp
lemen
taryanalysis
StructuralvariantsMutationaldensityCoverageandcopynumber
Mutationalsignatures
Hypermutation rainplotsMutationcontext
![Page 40: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/40.jpg)
Cellbase• Referencedatastore/AnnotationEngineOpenCGA• Catalog:metadataandclinicaldatastore• Storage:variantdatabaseInterpretationPlatform• Interpretationservice:managevariousproducersandconsumers• Interpretationwarehouse(underconstruction):storesandservesinterpretationdata
Bioinformaticsplatformcomponents
https://github.com/opencb/opencgahttps://github.com/opencb/cellbase
![Page 41: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/41.jpg)
OpenCB familyofapplications
InterfaceLayer
OpenCGACatalog
OpenCGAStorageCellbase
MongoDB MongoDB MongoDB HBASE PosixFS
GenomeBrowser
VariantAnalysis
DataDiscovery
![Page 42: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/42.jpg)
Cellbase
• Knowledgebasemanagement• UsesEnsembl,Uniprot,IntAct,ClinVar,etc.• CurrentdatabaseengineisMongoDB• JSONoutputsagainstwelldefinedmodel• SupportsannotationagainstlocalDBs• Annotatesabout10,000variants/secondperinstance• PythonandRAPIs
http://nar.oxfordjournals.org/content/40/W1/W609.short
![Page 43: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/43.jpg)
AnnotationagainstCellbase
http://bioinfo.hpc.cam.ac.uk/cellbase/webservices/https://github.com/opencb/cellbase
![Page 44: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/44.jpg)
CellBase 4.0- VEP82Consequencetypebenchmark(1kGphase3,83Mvariants)
● VEPannotations:346M● CellBaseannotations:346M● CoincidenceatSOtermlevel(346Mannotations)
– AnnotationsprovidedbyVEPandnotprovidedbyCellBase:3364(99.999%coincidence)
– AnnotationsprovidedbyCellBaseandnotprovidedbyVEP:4918(99.999%coincidence)
● 60%DuetodifferencesonmiRNAdatasources● 39%DifficultieswithVEPoutputformatparsing
● Coincidenceatvariantlevel(83Mvariants)– Variantswithconflictingannotation:4990(99.994%coincidence)
![Page 45: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/45.jpg)
AnnotationforphasedMNVsandCNVs• SupportforCNVsnewinCellBase4.5Beta
• Mainchallenge:supportimprecisecalling-matchagainstalreadyreportedCNVs(populationfrequencies,clinicalvariants)
• Sameannotationdataasfortherestofvariants:consequencetype,populationfrequencies,etc.
• ExampleCNV
• SupportforMNVsandphasedvariantsfromCellBase4.0• Consequencetypedependsonvariantsaffectingthesamecodon
• Variantsareassignedaphaseset(phasedVCFsincludethePStag)-allvariantsonthesamephasesetshallbeprocessedtogether
![Page 46: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/46.jpg)
AnnotationofMNVs
• Example:17:270550:AACAG:TGCAA• ExampleMNV
• Decomposeintosinglephasedvariantsmembersofthesamephaseset:
{"id":"17:270550:A:T","result":[{"codon":"cAA/cTG","proteinVariantAnnotation":{"reference":"GLN","alternate":"LEU"},"sequenceOntologyTerms":[{"accession":"SO:0001583","name":"missense_variant"}
{"id":"17:270551:A:G","result":[{"codon":"cAA/cTG","proteinVariantAnnotation":{"reference":"GLN","alternate":"LEU"},"sequenceOntologyTerms":[{"accession":"SO:0001583","name":"missense_variant"}
{"id":"17:270554:G:A","result":[{"codon":"caG/caA","proteinVariantAnnotation":{"reference":"GLN","alternate":"GLN"},"sequenceOntologyTerms":[{"accession":"SO:0001819","name":"synonymous_variant"}
![Page 47: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/47.jpg)
OpenCGA - Catalog
MetadatastoreandA&AforOpenCGA• Managesroles,groups,acls• Auditlog• LDAPintegration• Arbitraryschemas(annotationsets)
![Page 48: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/48.jpg)
6 node Hadoop cluster:• Transform: 97 min• Load: 80 sec• Merge: 84 sec• Millisecond response
times for regional queries
• Whole genome filtering queries for all individuals within seconds
OpenCGA - Storage
Extensivecapabilitiestoqueryacrossgenotypeandphenotyperelationships
![Page 49: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/49.jpg)
AspirationtobefullyGA4GHcompatiblefromv1.0
![Page 50: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/50.jpg)
Platformforinterpretation(underconstruction)
![Page 51: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/51.jpg)
![Page 52: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/52.jpg)
Key(personal)learnings
• Thereisgreatstrengthinmultidisciplinaryteamswithspecialisation,butthoseindividualsthatcanspanbothbiology/geneticsandsoftwareengineerarepivotal–theconnectthespecialist• Goodsoftwareengineeringpracticesalsoapplytobioinformatics,tonameafew:designing,documenting,Testing,supportandservice.Skippingthemdon’treallysaveyoutime• Ihavebecomeabigfanofusingwellestablishedtechnologieswithrichecosystems(e.g.hadoop)ratherthaninventingnewformats,datastructures,toolchains
![Page 53: Bioinformatics for the 100,000 Genomes Projectbioinformaticsbarcelona.eu/media/upload/arxius/GenomicsEngland_112016/... · Outline • Introduction to the UK’s 100,000 genomes project](https://reader033.vdocuments.mx/reader033/viewer/2022050716/5e16a82ba26570776555a193/html5/thumbnails/53.jpg)
Finalthoughts
• Thefutureinhumangeneticswillbeunderpinnedbyacademic/industrialpartnerships;boththetaskandthebenefitsaretoobigtogoatitalone• GenomicMedicineisjustoneofthepilotsofadigitalrevolutioninhealthcarewhereartificialintelligencewillcomplement/replacethediagnosticjourney• Butgenomicsistheeasypart,clinicaldataistherealchallenge