visualiseringsverktyg for data fran helgenomsekvensering951679/fulltext01.pdf · abstract whole...

84
INOM EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP , STOCKHOLM SVERIGE 2016 Visualiseringsverktyg for data fran helgenomsekvensering ALEXANDER KVIST RASMUS LARSSON KTH SKOLAN FÖR TEKNIK OCH HÄLSA

Upload: others

Post on 28-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

INOM EXAMENSARBETE TEKNIK,GRUNDNIVÅ, 15 HP

, STOCKHOLM SVERIGE 2016

Visualiseringsverktyg for data fran helgenomsekvensering

ALEXANDER KVIST

RASMUS LARSSON

KTHSKOLAN FÖR TEKNIK OCH HÄLSA

Page 2: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to
Page 3: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Detta examensarbete har utförts i samarbete med Centrum för Molekylär Medicin (CMM)Handledare på CMM: Jesper Eisfeldt

Visualiseringsverktyg fördata från

helgenomsekvensering

Visualization tools for data from whole genomesequencing

A l e x a n d e r K v i s tR a s m u s L a r s s o n

Examensarbete inom medicinsk teknikGrundnivå, 15 hp

Handledare på KTH: Mattias MårtenssonExaminator: Lars Gösta Hellström

Skolan för teknik och hälsa

Kungliga Tekniska HögskolanKTH STH

SE-141 86 Flemingsberg, Swedenhttp://www.kth.se/sth

2016

Page 4: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to
Page 5: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Sammanfattning

Helgenomsekvensering genererar enorma mangder komplex data somkan vara svar att analysera. Visualisering av denna data ar ett viktigt stegfor att underlatta analys. Av speciellt intresse ar visualisering av struktu-rella varianter, variationer i DNA storre an 1000 baspar, som tros ligga tillgrund for flera genetiska sjukdomar. For detta andamal utvecklades fy-ra verktyg: ett cirkeldiagram, ett tackningsdiagram, ett karyotypdiagramoch en interaktionsvarmekarta. Mjukvaran skrevs i spraket Python och ut-nyttjar ramverket Qt och tillhorande Python-bindningar for dess grafiskaanvandargranssnitt, tillsammans med biblioteket Matplotlib for att plottavissa grafer. Verktygen innehaller en mangd funktioner och knyter ihopdessa i ett for anvandaren enkelt granssnitt, men plats for vidareutveck-ling finns. En rad forslag till sadan vidareutveckling diskuteras, sa somatt implementera fler funktioner, integrera verkygen battre med befintligmjukvara, och forbattra portabilitet genom natverksfunktioner.

i

Page 6: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to
Page 7: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Abstract

Whole genome sequencing generates enormous amounts of complexdata that can be difficult to analyze. Visualization of this data is animportant step to facilitate analysis. Of particular interest is visualizationof structural variants, variations in DNA greater than 1000 base pairs,some of which are thought to be the cause of genetic disorders. For thispurpose four tools were developed: a circle diagram, a coverage diagram, akaryotype diagram and an interaction heatmap. The software was writtenin Python and utilizes the framework Qt and associated Python-bindingsfor its graphical user interface, together with the library Matplotlib forsome plotting functions. Although the tools feature a variety of functionsand tie these together in an easy to use interface, there is still room fordevelopment. A number of suggestions for such development is discussed,such as implementing more functions, integrating the tools better withexisting software, and improving portability through network functions.

iii

Page 8: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to
Page 9: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Innehall

1 Inledning 11.1 Syfte och mal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Avgransningar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Bakgrund 22.1 Det manskliga genomet . . . . . . . . . . . . . . . . . . . . . . . 22.2 Strukturella varianter . . . . . . . . . . . . . . . . . . . . . . . . 22.3 Sekvenseringsdata . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.3.1 VCF-fil . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.3.2 TAB-fil . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.4 Urval av befintlig mjukvara for visualisering . . . . . . . . . . . . 32.5 Diagramtyper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.5.1 Cirkeldiagram . . . . . . . . . . . . . . . . . . . . . . . . . 32.5.2 Tackningsdiagram . . . . . . . . . . . . . . . . . . . . . . 42.5.3 Karyotypdiagram . . . . . . . . . . . . . . . . . . . . . . . 42.5.4 Interaktionsvarmekarta . . . . . . . . . . . . . . . . . . . 4

3 Genomforande 53.1 Specifikationer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Sprak, bibliotek, resurser och uppbyggnad . . . . . . . . . . . . . 53.3 Uppfoljning och testning . . . . . . . . . . . . . . . . . . . . . . . 5

4 Resultat 64.1 Gemensamma funktioner . . . . . . . . . . . . . . . . . . . . . . . 74.2 Cirkeldiagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.3 Tackningsdiagram . . . . . . . . . . . . . . . . . . . . . . . . . . 104.4 Karyotypdiagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.5 Interaktionsvarmekarta . . . . . . . . . . . . . . . . . . . . . . . 13

5 Diskussion 145.1 Vidareutveckling . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.1.1 Cirkeldiagram . . . . . . . . . . . . . . . . . . . . . . . . . 155.1.2 Tackningsdiagram . . . . . . . . . . . . . . . . . . . . . . 155.1.3 Karyotypdiagram . . . . . . . . . . . . . . . . . . . . . . . 155.1.4 Interaktionsvarmekarta . . . . . . . . . . . . . . . . . . . 155.1.5 Allmant . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 Slutsatser 17

7 Kallforteckning 18

Bilaga A Figurer

Bilaga B Kallkod

v

Page 10: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to
Page 11: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

1 Inledning

Ar 2004 publicerade Human Genome Project det manskliga genomet. Dettavar ett enormt framsteg for att identifiera och kartlagga gener i det manskligagenomet, mojliggjord genom metoder for DNA-sekvensering som innebar attden sekvens av baspar som utgor DNA faststalls (Schmutz et al., 2004). Varjeindivid har ett unikt genom, och mycket vardefull information kan fas genomatt studera de varianter av gener som utgor individens DNA. En del av vari-ationen av det manskliga genomet utgors av sa kallade strukturella varianter.Dessa ar forandringar i DNA storre an 1000 baspar (Feuk, Carson, och Scherer,2006, p.86), och ar sarskilt intressanta att studera da de utgor en roll i fleragenetiska sjukdomar. Centrum for Molekylar Medicin (CMM) vid KarolinskaInstitutet har exempelvis i en studie av tva patienter med dyslexi identifierattva translokationer, en typ av strukturell variation, i bada patienter. Med hel-genomsekvensering har man sedan lyckats precisera positionen av dessa till enregion i genen CTNND2. Genom vidare studier kunde translokationernas effektpa genen kopplas till dyslexi hos patienterna (Hofmeister et al., 2014).

Helgenomsekvensering, eller whole genome sequencing (WGS) pa engelska,ar en teknik som mojliggor precis identifiering av strukturell variation i ge-nomet, men sekvenseringen genererar en enorm mangd och ofta komplex data.Analys av denna data kraver expertis inom flera omraden och forsvaras ytter-ligare da tusentals varianter ej ar av kliniskt intresse. Specialiserad mjukvarafinns tillganglig som analyserar monster i WGS-data. Denna mjukvara hittarstrukturella varianter och skriver dessa till en sa kallad VCF-fil (variant call for-mat). Dock saknar uppdragsgivaren CMM i nulaget tillfredstallande mjukvarafor att forenkla analysen genom visualisering av VCF-data.

1.1 Syfte och mal

Projektet syftar till att hjalpa utovare inom molekylar genetik att analyserastrukturella varianter i ett genom, for att lattare kunna identifiera sadana vari-anter som tros ligga till grund for olika genetiska sjukdomar. Malet med projek-tet ar saledes att utveckla en uppsattning verktyg som anvander VCF-filer ochannan data fran sekvenseringen for att presentera en rad strukturella variantersa som deletioner och translokationer i ett grafiskt granssnitt. De visualiserings-verktyg som eftersoks ar foljande: cirkeldiagram, tackningsdiagram, karyotyp-diagram och en interaktionsvarmekarta. Mjukvaran skall aven vara enkel for enslutanvandare att hantera och installera, och vara plattformsoberoende.

1.2 Avgransningar

Visualeringsverktygen ar utformade efter uppdragsgivarens VCF-filer. Aven fastVCF-formatet ar standardiserat kan VCF-filer skilja sig i innehall och till viss delstruktur mellan varandra. Programvaran utvecklad i detta projekt ar begransadtill VCF-filer fran uppdragsgivaren och garanti om kompabilitet till VCF-filerfran andra kallor kan ej ges.

1

Page 12: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

2 Bakgrund

2.1 Det manskliga genomet

Ett genom ar en organisms sammantagna genetiska kod som for manniskanutgors av lite mer an 3 miljarder DNA-baspar, uppdelade i 23 par av kromoso-mer. Genomet bestar av bade icke-kodande DNA och sadana sekvenser, gener,som kodar for olika protein. Mutationer i genomet kan besta av forandringar iallt fran enskilda baspar till hela kromosomer. Medan det formodas att de flestamutationer inte har nagon sarskild effekt pa en individ, kan mutationer i vissagener leda till defekta proteiner med risk att ge upphov till genetisk avvikelse(Doniger et al., 2008).

2.2 Strukturella varianter

Manniskor tros ha 99,9% identiskt DNA med varandra. Genom att studera denlilla del som skiljer sig, variationen, kan nya upptackter goras inom en mangdolika biomolekylara falt. Forskare har lange kunnat pavisa skillnader mellanmanniskors DNA. Skillnaderna har dock lange varit avgransade till variationer ikromosomer som kunnat ses i mikroskop, runt 3 miljoner basbar stora. Modernauppfinningar sa som helgenomssekvensering har mojliggjort att annu mindreskillnader kan detekteras. En strukturell variant definieras som en mutation iDNA som ar storre an 1000 baspar. Termen strukturell abnormalitet anvandsofta om den strukturella varianten tros vara sjukdomsalstrande. Foljande arexempel pa strukturella varianter:

• Deletion - en sekvens pa kromosomen saknas

• Duplikation - en sekvens repeteras

• Inversion - en sekvens har omvand riktning gentemot resten av kromoso-men

• Insertion - en sekvens kopieras och satts in pa en annan plats pa kromo-somen

• Translokation - en positionsandring av en sekvens i genomet dar det totalaDNA-innehallet inte har andrats (Feuk, Carson, och Scherer, 2006)

2.3 Sekvenseringsdata

2.3.1 VCF-fil

Det mest innehallsrika filformat som star till grund for visualiseringen ar densa kallade VCF-filen (variant call format). Detta ar ett standardiserat formatsom beskriver olika strukturella varianter inom ett genom. Filen ar lampligenutformad pa sa satt att den kan lasas av bade manniskor och datorer. Den bestarav ett stort antal rader text, dar varje rad representerar en variant och beskriverparametrar sa som kromosom, position, varianttyp och en mangd annan data(Danecek et al., 2011). Mangden rader i en fil kan variera krafitgt beorendepa till vilken grad filen ar filtrerad (VCF-filen tillgodosedd av uppdragsgivarenar ca 2000 rader) men ar ofta mycket stor. Detta kombinerat med den oftakomplicerade utformningen av vissa parametrar for varje variant ger i sig upphovtill att informationen blir svarhanterlig for ett otranat oga.

2

Page 13: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

2.3.2 TAB-fil

TAB-filen bestar av tackningsdata och kommer fran samma sekvensering somVCF-filen. Tackning ar ett matt pa hur manga ganger en inlast sekvens matcharmotsvarande sekvens i ett referensgenom, en slags medelvardesbildning av fleraindividers genom. I TAB-filen ar varje kromsom uppdelad i sekvenser som ar1000 baspar langa, med ett tackningsvarde for varje sekvens. Ett lagt vardebetyder att en sekvens formodligen inte finns narvarande i det sekvenseradegenomet och kan tyda pa en deletion. Ett hogt varde kan pa liknande satt tydapa en duplikation. Filnamnet kommer fran hur data ar representerad i filen, darparametrar for varje rad i filen ar separerade med tab-tecken.

2.4 Urval av befintlig mjukvara for visualisering

Ett av de mest anvanda verktygen for att visualisera sekvenseringsdata ar In-tegrative Genomics Viewer (IGV) utvecklad av Broad Institute, ett samarbetemellan MIT och Harvard. Verktyget erbjuder interaktiv visualisering av datapa alla skalor, fran hela genom till enskilda baspar. Utveckling startade 2007 foratt kunna visualisera data fran projektet The Cancer Genome Atlas, nagot somdavarande verktyg inte kunde gora pa ett tillfredstallande satt. IGV ar skrivet iJava och stodjer ett flertal filformat for sekvenseringsdata, varav stod for VCFlades till 2011. En datastruktur som bygger pa olika nivaer av upplosning, darvarje pixel pa skarmen motsvarar en viss medelvardesbildad langd av genomet,haller programmets anvandning av minne till ett minimum (Thorvaldsdottir,Robinson, och Mesirov, 2013).

Cirkeldiagram ar effektiva for att visa positionella samband i ett genom.Circos ar ett program skrivet i Perl som kors fran en kommandotolk for attgenerera cirkulara diagram utifran allt fran hela genom ner till individuellasekvenser. Programmet anvander datafiler av formatet General Feature Format(GFF) och en rad konfigurationsfiler som input, och genererar diagrammen sombilder. Olika nivaer av upplosning kan anvandas for olika regioner i genomet(Krzywinski et al., 2009).

The Personal Genome Browser (PGB) utvecklades 2014 for att kunna ge enomfattande annotering samt visualisering av organismers genom. PGB anvandersig av tackningsdiagram, cirkeldiagram med mera for att visualisera genetiskavariationer. Atkomst gors over internet via webblasare sa som Google Chromeeller Mozilla Firefox men programmet kan aven distribueras till lokala servrar(Juan et al., 2014).

2.5 Diagramtyper

2.5.1 Cirkeldiagram

Mjukvara som visualiserar genomdata i form av cirkeldiagram finns idag redantillgangliga (Krzywinski et al., 2009). Uppdragsgivaren eftersoker dock ett sattatt automatiskt generera ett cirkeldiagram fran VCF-filerna genom ett grafisktanvandargranssnitt utan komplicerade konfigurationsfiler. Ett exempel pa hurett cirkeldiagram kan se ut kan ses i figur 1. Diagrammet ska vara indelat iolika cirkelsektorer som varje motsvarar ett kromosompar. Relationer mellankromosompar synliggors med hjalp av linjer mellan dem. Relationerna bestamsav strukturella varianter. Inuti den yttre cirkeln finns en graf over tackningen.

3

Page 14: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Figur 1: Exempel pa ett cirkeldia-gram (Morgan, Walker, och Davies,2012, p.342). Se aven figur 1, bilagaA.

Figur 2: Karyotyp (Zhao et al.,2005, p.360). Se aven figur 2, bila-ga A.

Utover diagrammet ska det aven finnas information om kromosomernas langd,vilka varianter de innehaller, varianternas langd och namn.

2.5.2 Tackningsdiagram

Tackningsdiagrammet ar en visualisering av tackningsdata fran TAB-filen. Tackningenfor sekvenser pa olika positioner i en kromosom plottas har i ett linjediagram.

2.5.3 Karyotypdiagram

Den grafiska framstallningen av kromosomer, parade och ordnade i stigandeordning kallas for en karyotyp. Kromosomerna forbereds med en standardise-rad fargningsprocedur som synliggor intressant strukturell karakteristik. Det arfargningen som ger upphov till de morka banden (cytoband) pa kromosomerna,se figur 2. Genom noggrann analys av karyotyper kan genetiker upptacka storregenetiska abnormaliteter sasom trisomi 21 (Downs syndrom) men aven struk-turella varianter som deletioner, duplikationer och translokationer (O’Connor,2008). Karyotypdiagrammet utgar fran en karyotyp dar relationer mellan kro-mosomer ska kunna ses i form av linjer mellan dem, likt linjerna i cirkeldiagram-met. Pa liknande satt som i cirkeldiagrammet ska det aven finnas mer ingaendeinformation om kromosomerna och deras varianter.

2.5.4 Interaktionsvarmekarta

Detta diagram amnar att framhava kromosomers vaxelverkan med varandra.I cirkeldiagrammet och karyotypdiagrammet uppnas detta med hjalp av lin-jer. Har gors detta med hjalp av en varmekarta istallet. Antalet interaktionerfargkodas med hjalp av en skala, dar fa interaktioner kan t.ex. kodas till fargenbla medan manga interaktioner far fargen rod.

4

Page 15: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

3 Genomforande

3.1 Specifikationer

Arbetet inleddes med ett mote med uppdragsgivaren for att diskutera specifi-kationer och begransningar av mjukvaran, samt lampliga sprak och verktyg forutveckling av mjukvaran. Foljande riktlinjer sattes upp som en startpunkt forutvecklingen:

• Installationen av mjukvaran ska vara enkel.

• Mjukvaran ska vara plattformsoberoende.

• Prioritet ska ligga pa att slutanvandaren latt ska kunna hantera mjukvaranoch dess funktioner, framfor ett valpolerat utseende.

I samband med motet sattes en tidsram upp for varje steg i utvecklingen, darfokus bestamdes for att laggas pa cirkeldiagrammet, tackningsdiagrammet ochinteraktionsvarmekartan.

3.2 Sprak, bibliotek, resurser och uppbyggnad

Som utvecklingssprak valdes Python version 3, med det plattformsoverskridanderamverket Qt (version 4.8) for framstallning av grafik och fonsterhantering, till-sammans med Python-biblioteket PySide som forser bindningar mellan Pythonoch Qt. Aven biblioteket matplotlib och dess Qt back-end anvandes for att fram-stalla vissa plottar. For att skapa exekverbara filer av mjukvaran anvandes pro-grammet PyInstaller. Uppdragsgivaren bistod med VCF- och TAB-filer skapadmed DNA fran verkliga patienter for anvandning under utveckling av mjukva-ran. Eftersom detta ar patientdata vidholls sekretess kring dessa filer under ar-betet. Programmets interna struktur bygger pa arkitekturmonstet Model-View-Controller, dar data, vy, och kontroll ar indelade i olika moduler. For att fainblick i hur genetisk data kan hanteras och visualiseras, studerades befintligprogramvara. Uppdragsgivaren rekommenderade ett flertal relevanta artiklarom sadan programvara och aktuell forskning kring genetik.

3.3 Uppfoljning och testning

Uppfoljningsmoten med uppdragsgivaren skedde kontinuerligt under viktiga stegi utvecklingen for att utvardera implementerade funktioner och for att satta uppnya specifikationer. For att sakerstalla plattformsoberoende testades mjukvaranpa Linux-system, Windows, och Mac OS X da betydande forandringar eller nyafunktioner implementerats.

5

Page 16: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

4 Resultat

Den utvecklade mjukvaran har med framgang testats pa Linux (Debian 9,Stretch), OS X (10.11, El Capitan), Windows 7, och Windows 10. For enkelinstallation av mjukvaran skapades exekverbara filer som startar programmettill Windows och Linux, men dock inte till OS X under projektets tidsram.Daremot kan mjukvaran fortfarande startas genom terminalen i fallet OS X.

Sammantaget resulterade projektet i ett program som kan lasa in ett an-tal olika typer av datafiler och bearbeta dessa till strukturer som fyra olikavisualiseringsmoduler kan arbeta med. Modulerna skapar interaktiva diagram-typer utifran de diagram som beskrivs i 2.5. En anvandare kan stalla in olikainstallningar for hur och vilken data varje modul visualiserar efter dennes be-hov. Data presenteras i form av grafik, fonster och listor med information omkromosomer, tackningsdata for regioner i dessa, och de varianter som lasts in.Vissa grafikelement kan anvandaren sjalv lagga till, sa som forklarande text idet fall detta behovs. Programmet kan exportera bilder av det som visualiserasfor att kunna anvandas i rapporter, publikationer och liknande.

Cirkeldiagrammets huvudsakliga funktion ar att visualisera hur strukurellavarianter ar positionerade inom ett genom. Detta i form av linjer mellan kromo-somer. En linje definieras av en variants start- och slutposition. Tackningsdatavisas upp som en mindre cirkel inom den stora. Mer ingaende information finnsi tabeller som ar integrerade i programmet. Dar finns information om kromoso-mers langder och antalet varianter de har. Det finns aven specifik informationom varje variant: start- och slutposition, dess typ samt vilka gener den paverkar.

Tackningsdiagrammet visualiserar tackningsdata i antingen ett linje- ellerpunktdiagram. Detta lage ar anvandbart for att i detalj kunna titta narmare paregioner med avvikande tackning som kan tyda pa deletioner eller duplikationer.

Karotypdiagrammets anvandningsomrade liknar cirkeldiagrammets men geren battre overblick over kromosomers karakteristiska band och vilka variantersom ligger i dessa, och hur de ar kopplade genom att rita ut kopplingar mellanpaverkade band.

Ett annat satt att visualisera varianters positioner, likt cirkeldiagrammet,ar genom interaktionsvarmekartan. Axlar i ett koordinatsystem representerarhar positioner pa kromosomer. Om flera varianter har narliggande start- ochslutpositioner ritas en varmare farg ut pa motsvarande omrade i varmekartan.

Nedan redogors mjukvarans funktionalitet i detalj genom en inledande skarmdumpav dess olika delar samt en forklarande text.

6

Page 17: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

4.1 Gemensamma funktioner

Genom en filmeny File, figur 3, kommer anvandaren at programmets centralafunktioner. Menyn innehaller foljande funktioner:

Figur 3: Gemensamma funktioner hos visualiseringsverktygen. Se aven figur 3,bilaga A.

• New CIRCOS initialiserar cirkeldiagrammet och VCF- samt TAB-fil upp-manas valjas.

• New coverage diagram startar tackingsdiagrammet dar TAB-filen uppma-nas valjas.

• New karyogram initialiserar karyotypdiagrammet och VCF- samt TAB-filuppmanas valjas, tillsammans med en fil med definitioner for cytoband.

• New heatmap startar interaktionsvarmekartan och VCF- samt TAB-filuppmanas valjas.

• Settings oppnar upp installningsfonstret. Detta har olika utseende bero-ende pa vilket diagram som ar aktivt. Se 4.2, 4.3, 4.4 och 4.5.

• Export image exporterar den aktiva vyn till en bildfil.

• Exit stanger ned programmet.

7

Page 18: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

4.2 Cirkeldiagram

Figur 4: Skarmdump av cirkeldiagrammet i sin helhet. Se aven figur 4, bilagaA.

Figur 4 visar ett aktivt cirkeldiagram och informationsfonster. Dess olika cen-trala funktioner ar markerade och numrerade och forklaras i tur och ordningnedan.

1. Arbetsfaltet hogst upp i fonstret anvands for navigation samt diverse funk-tioner:

• Knappen Chromosomes oppnar fonstret Chromosome info, se 3.

• Om anvandaren andrat installningar kan diagrammet uppdateras viaknappen Update CIRCOS.

• Tackningsdiagrammet kan sattas av och pa genom knappen ToggleCoverage.

• Anvandaren kan lagga till egna bilder till diagrammet via Add Imageto plot.

• For att farga specifika regioner pa varje kromosom kan en TAB-fillasas in genom knappen Import a color TAB.

2. Variantdata ses har som en tabell:

• Kolumnen START anger startpositionen for varianten pa kromoso-men matt i baspar.

• Den andra kolumnen ALT beskriver vad for typ varianten ar av, t.ex.duplikation, translokation m.m.

• Kolumnen END anger slutpositionen for varianten pa kromosomenmatt i baspar.

• Den sista kolumnen GENE(S) listar upp de gener som pa nagot sattpaverkas av varianten.

8

Page 19: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

3. Overgripande information om kromosomerna samt diagrammet ses harsom en tabell:

• Kolumnen Name anger kromosomens namn d.v.s. 1-22 eller X, Y.

• Nasta kolumn Length anger langden av kromosomen matt i baspar.

• No. of variants anger antal varianter en kromosom innehaller.

• Kolumnen Display innehaller checkboxar som indikerar om kromo-somen visas i diagrammet.

• Den sista kolumnen Draw connections innehaller checkboxar som in-dikerar om linjer, illustrerande translokationer, ska visas.

• Knappen Toggle display anvands for att visa eller dolja en specifikkromosom. Status ses i kolumnen Display.

• Knappen View variants oppnar upp ett fonster for variantdata, se 2.

• Om en anvandare manuellt vill lagga till en extra variant kan dettagoras med knappen Add variant.

• For att stanga av och pa kopplingar mellan kromosomer anvandsknappen Toggle connections. Status kan ses i kolumnen Draw con-nections.

4. Genom att forst klicka pa File och sedan Settings uppe i menyraden kom-mer ett fonster for installningar oppnas:

• Upplosningen for tackningsdiagrammet stalls in med BP Resolution(kb).

• Huruvida log2 av varden ska anvandas eller inte i tackingsdiagrammetbestams med Use log2 of coverage.

• Lagsta och hogsta vardet (i procent) for tackningsdiagrammet kananges med Min.coverage value (%) och Max.coverage value (%).

• Tjockleken pa kopplingarna mellan kromosomerna i diagrammet stallsin med Width of connections.

• Show chromosome name bestammer om kromosomernas namn skavisas eller inte i diagrammet.

5. Kopplingarna har en inbordes fargkodning. Om flera kopplingar gar tillsamma region pa en kromosom far kopplingarna en morkare farg ju narmarekromosomen de kommer.

6. Tackningsdiagrammet ses har som en mindre cirkel inuti den yttre darvarje punkt har ett medelvardesbildat tackningsvarde tillhorande mot-svarande position pa den bakomliggande kromosomen. En hogre installdupplosning ger ett medelvarde over en storre region. Den roda fargen be-tyder att tackningen for denna region pa kromosomen ar lag jamfort medreferensgenomet. Den svarta fargen betyder att tackningen ar nara me-deltackningen, medan en bla farg skulle innebara hog tackning.

7. Kromosomer kan markeras genom vansterklick. Genom att sedan gora etthogerklick kan deras farger valjas med en fargredigerare.

9

Page 20: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Ett hogerklick utan att forst ha markerat nagra kromosomer oppnar en menyfor att antingen lagga in text i bilden eller for att skapa en fargad etikett medtext i. Texter, etiketter och bilder kan flyttas runt i scenen. Genom att hallamusen over en kromosom visas ett tooltip med information om kromosomnamn,langd, och antal varianter.

4.3 Tackningsdiagram

Figur 5: Skarmdump som visar tackningsdiagrammet i sin helhet. Se aven figur5, bilaga A.

Ett aktivt tackningsdiagram kan ses i figur 5. Dess olika centrala funktioner armarkerade och numrerade och forklaras i tur och ordning nedan.

1. Arbetsfaltet hogst upp i fonstret anvands for navigation samt diverse funk-tioner:

• Knappen Add subplot lagger till ett diagram i fonstret, se 3.

• Update layout uppdaterar hur diagrammen ar arrangerade.

2. Genom att forst klicka pa File och sedan Settings uppe i vanstra hornetpa huvudfonstret kommer installningsfonstret oppnas:

• Upplosningen for diagrammen stalls in med BP Resolution (kb).

• Lagsta och hogsta vardet (i procent) for tackningsdiagrammet kananges med Min.coverage value (%) och Max.coverage value (%).

• Number of columns bestammer antalet diagram per rad i huvudfonstret.

3. Hur tackningsdiagrammet ska representeras kan har valjas, linje- ellerpunktdiagram, samt vilken kromosom som ska behandlas.

4. Varje diagram har ett eget verktygsfalt. Verktygsfaltet innehaller blandannat verktyg for att zooma in och ut i diagrammet samt navigation i x-

10

Page 21: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

och y-led. Sjalva diagrammets dimensioner kan ocksa andras. Data for di-agrammets axlar kan aven de andras, t.ex. om skalan ska vara logaritmiskeller linjar. Titeln for diagrammet samt axlarnas namn ar ocksa rediger-bara. Det finns aven mojlighet till att spara ett enskilt diagram som enbildfil.

5. Varden over 125% av medeltackningen ar fargade roda och varden under75% av medeltackningen ar fargade grona medan resten ar fargade svarta.

6. Att klicka control + hogerklick pa ett diagram oppnar en meny for bort-tagning av respektive diagram samt infogning av en textruta.

4.4 Karyotypdiagram

Figur 6: Skarmdump som visar karotypdiagrammet i sin helhet. Se aven figur6, bilaga A.

Figur 6 visar ett aktivt karyotypdiagram och informationsfonster. Dess olikacentrala funktioner ar markerade och numrerade och forklaras i tur och ordningnedan.

1. Arbetsfaltet hogst upp i fonstret anvands for navigation samt funktioner-na:

• Knappen Chromosomes oppnar fonstret Chromosome info, se 3.

• Om anvandaren andrat installningar kan diagrammet uppdateras viaknappen Update karyogram.

2. Sjalva karyogrammet ses har med kromosomerna vertikalt uppradade isekvens och har foljande funktioner:

• Fargen pa kromosomernas cytoband kan andras, se 4.

• Cytobandens namn kan sattas pa och av, se 5.

11

Page 22: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

• Interkromosomala interaktioner ses som linjer mellan olika cytobandi kromosomerna och kan sattas pa och av, se 5.

• Kromosomerna kan flyttas runt i huvudfonstret genom att dra demmed vanster musknapp nedtryckt.

• Genom att halla musen over en kromosoms cytoband visas ett tooltipmed information om bandets namn och langd.

3. Variantdata ses har som en tabell:

• Kolumnen START anger startpositionen for varianten pa kromoso-men matt i baspar.

• Den andra kolumnen ALT beskriver vad for typ varianten ar av, t.ex.duplikation, translokation m.m.

• Kolumnen END anger slutpositionen for varianten pa kromosomenmatt i baspar.

• GENE(S) listar upp de gener som pa nagot satt paverkas av varian-ten.

• Den sista kolumnen CYTOBAND visar start- och slutband i vilkavarianten befinner sig.

4. De olika banden har standardiserade farger men dessa kan i installningsfonstretandras efter anvandarens behag.

5. Overgripande information om kromosomerna samt diagrammet ses harsom en tabell:

• Kolumnen Name anger kromosomens namn d.v.s. 1-22 eller X, Y.

• Nasta kolumn Length anger langden av kromosomen matt i baspar.

• No. of variants anger antal varianter som en kromosom innehaller.

• Kolumnen Display innehaller checkboxar som indikerar om kromo-somen visas i diagrammet.

• Kolumnen Draw connections innehaller checkboxar som indikerar omlinjer illustrerande translokationer ska visas.

• Den sista kolumnen Cyto band names innehaller checkboxar som in-dikerar om kromosomens bandnamn ska visas.

• Knappen Toggle display anvands for att visa eller dolja en specifikkromosom. Status ses i kolumnen Display.

• Knappen View variants oppnar upp ett fonster for variantdata, se 3.

• For att stanga av och pa kopplingar mellan kromosomer anvandsknappen Toggle connections. Status kan ses i kolumnen Draw con-nections.

• For att stanga av och pa kromosomers bandnamn kan detta gorasmed knappen Toggle cyto band names. Satus ses i kolumnen Cytoband names.

6. Via hogerklick var som helst i huvudfonstret oppnas alternativ for infog-ning av textrutor och textetiketter sa som for cirkeldiagrammet.

12

Page 23: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

4.5 Interaktionsvarmekarta

Figur 7: Skarmdump som visar interaktionsvarmekartan i sin helhet. Se avenfigur 7, bilaga A.

I figur 7 visas en interaktionsvarmekarta. Dess olika centrala funktioner ar mar-kerade och numrerade och forklaras i tur och ordning nedan.

1. Arbetsfaltet hogst upp i fonstret anvands for navigation samt diverse funk-tioner:

• Knappen Add heatmap lagger till en interaktionsvarmekarta i fonstret,se 2 och 3.

• Update layout uppdaterar hur diagrammen ar arrangerade.

2. Grafen bestar av en matris vars rader och kolumner representerar positioni baspar pa kromosomer. Matrisens element bestar av antal interaktioner.Ett hogt antal ger en rod farg medan ett lagt antal bla, se 5.

3. Vilka kromosomer som ska jamforas mot varandra valjs har:

• Om samma kromosom valjs for bada axlar maste varianttyp valjas,se 4. Om olika kromosom ar valda for axlarna antas varianttypenvara en translokation.

• Bin size (kb) delar upp varje kromosom i bitar och bestammer darmedupplosningen av grafen.

4. Varianttyp som ska jamforas valjs har. Varianttyperna ar de typer somfinns specificerade i VCF-filen.

5. Fargkodning for element i matrisen visualiseras har i form av en rektangel.Fargkodningen andras dynamiskt. Hogst antal interaktioner i matrisen faralltid fargen rod och lagst antal far fargen morkbla.

6. Att klicka control + hogerklick pa ett diagram oppnar en meny for bort-tagning av det specificerade diagrammet.

13

Page 24: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

5 Diskussion

Helgenomsekvensering ar en snabbt vaxande teknologi och kan ge viktiga in-sikter om manniskans uppbyggnad samt hennes sjukdomar. De verktyg somutvecklats i projektet kan liknas vid ett kugghjul i den overgripande processfor genetisk analys som har helgenomsekvensering som bas. De erbjuder enbred uppsattning funktioner som knyter ihop olika satt att visualisera data franVCF- och tillhorande TAB-filer. Genom diskussion med uppdragsgivare och de-monstration av deras befintliga verktyg och arbetsprocesser erholls en djupareinsikt i hur fortsatt utveckling av mjukvaran battre kan integrera med upp-dragsgivarens egenutvecklade verktyg.

Vid test och demonstration av de fyra utvecklade verktygen for uppdragsgi-vare och slutanvandare under uppfoljningsmoten uppkom en rad onskemal omytterligare funktioner och modifikationer, forutom rena forbattringar i mjuk-varan. Det finns alltsa mycket plats for vidareutveckling; tankar och ideer omdetta kan lasas i 5.1.

De utvecklade verktygen har inte lika manga funktioner eller ar lika omfat-tande som befintlig mjukvara sa som IGV och liknande. Daremot ar det enklasattet pa vilket de olika diagrammen genereras en fordel for manga anvandaresom inte ar vana eller bekvama vid att arbeta med en kommandotolk och kom-plicerade konfigurationsfiler. Samtidigt begransar detta flexibiliteten hos denutvecklade mjukvaran och lagger storre vikt pa inlasning och hantering av va-rierad data: en av de punkter som kan vidareutvecklas.

En fordel med att nyttja fyra verktyg i ett och samma program ar attvarje verktygs styrkor och svagheter kompletterar varandra. Karyotypdiagram-mets kopplingar mellan kromosomer blir latt otydliga da linjer overlappar medvarandra och gar over kromosomer. I detta fall kanske det vore battre attanvanda cirkeldiagrammet dar kopplingarna bojs av in mot det tomma utrym-met i mitten av cirkeln dar de inte skymmer viktig information. Daremot saknarcirkeldiagrammet de kannetecknande banden pa kromosomerna, nagot karyo-typdiagrammet har. Pa liknande satt tillhandahaller tackningsdiagrammet merdetaljerad information om tackningen men saknar ett satt att visa kopplingarkromosomer emellan. Interaktionsvarmekartan visar tydligt kopplingar mellankromosomer men kan endast hantera tva kromosomer at gangen.

Python ar ett dynamiskt och smidigt men anda kraftfullt programmerings-sprak som snabbare later anvandaren utveckla funktioner, jamfort med mangaandra sprak. Eftersom onskemal om nya funktioner och modifikationer var attforvanta, och det faktum att Python ar ett av de mest anvanda spraken i bio-informatiken, var det da ett sjalvklart val. Valet att anvanda matplotlib fortackningsdiagrammet och interaktionsvarmekartan gjordes da det ar ett kraft-fullt bibliotek som integrerar val med Qt. Det finns ett flertal andra bibliotekmed snarlika funktioner som aven de integrerar med Qt, men att matplotlibhar ett granssnitt mycket snarlikt MATLAB gor att slutanvandare som oftaanvander MATLAB redan ar bekanta med granssnittet.

Vid hantering av genetiska analyser uppkommer ofta etiska fragor. Data franett genom kan saga enormt mycket om en person i dagslaget och mycket troligtannu mer inom en snar framtid. Information kring arvda egenskaper, genetiskavvikelse, anlag for genetiska sjukdomar och liknande skapar fragor kring per-sonlig integritet och vem som bor kunna fa tillgang till denna data. De senastearen har mycket diskussion forts kring risker for genetisk diskriminering och hur

14

Page 25: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

bland annat forsakringsbolag eller arbetsgivare kan hindras fran att fa tillgangtill information om personers genom. Det ska dock belysas att mjukvaran somutvecklats under projektets gang endast underlattar analys av redan befintligdata, och paverkar inte hantering och sekretess av denna data i sig.

5.1 Vidareutveckling

5.1.1 Cirkeldiagram

En funktion som inte an ar implementerad ar positionsbestamning av geneti-ketter. Vid skapandet av en genetikett ska position pa vald kromosom kunnaanges och darmed fasta etiketten vid denna position, t.ex. via linjer. I nulagetkan etiketterna endast flyttas manuellt med vanster musklick och saknar linjertill kromosomer. Att aven kunna skilja kopplingar at efter varianttyp genom attexempelvis farga dem olika, och att kunna filtrera kopplingar efter varianttyp,hade varit en mycket anvandbar funktion. Nar en extra variant laggs till masteunderliggande data uppdateras.

5.1.2 Tackningsdiagram

For att fortydliga att tackningsdiagrammet representerar en kromosom kan enhorisontellt liggande kromosom, likt de i karyotypdiagrammet, infogas underdiagrammet. En pekare pa den liggande kromosomen skulle kunna ange detaktiva intervall tackningsdiagrammet visar.

5.1.3 Karyotypdiagram

De varianter tillhorande en viss kromosom finns i nulaget tillgangliga i en tabell,se punkt 3 i resultatdelen for karyotypdiagrammet. En vidareutveckling skullekunna vara att rita in valda varianter som band pa kromosomen, pa sammasatt som cytobanden fast med en annan farg. Detta for att tydligare markeravarianters utstrackning langs en kromosom, kanske de som ar speciellt viktiga iett kliniskt sammanhang. Vissa cytoband innehaller flera varianter, och att gekopplingar fran sadana band en intensivare farg skulle kunna ge snabb informa-tion om hur manga varianter som aterfinns for varje region. Att aven har kunnafiltrera och skilja kopplingar at genom olika farger hade varit anvandbart.

5.1.4 Interaktionsvarmekarta

En intressant funktion kan vara mojlighet till att zooma in pa varmekartan. Idagslaget ligger manga interaktioner pa en linje i mitten av fonstret. Detta pagrund av att varianternas storlek ofta ligger runt storleksordningen 1000 baspar.I relation till varmekartans axlar, som ar miljoner baspar langa, ser det utsom att varianten har samma start- och slutposition. En minskning av axlarnasomfang genom in-zoomning skulle sprida ut interaktionerna mer pa varmekartaneftersom start- och slutpositionerna da hamnar langre ifran varandra, vilketskulle ge anvandaren en mer intressant vy.

5.1.5 Allmant

For narvarande kan endast VCF-filer med ett visst specifikt format anvandas.Detta pa grund av programmets logik for att lasa in data. En pabyggnad kan

15

Page 26: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

darfor vara att utveckla denna logik och gora den mer flexibel sa att den avenkan hantera andra format.

Mojlighet till att ha uppe fler an ett verktyg samtidigt skulle ocksa varaintressant. Da cirkeldiagrammet ger en overgripande bild av sekvenseringsdatakan t.ex. en varmekarta ge kompletterande information om varianters positionpa en viss kromosom. Det skulle darfor vara bekvamt att ha bada de fonstrenuppe sida vid sida for jamforelse, istallet for att behova byta mellan de olikaverktygen eller ha uppe flera instanser av programmet samtidigt.

Ett satt att forbattra portabiltet och underlatta integration med uppdrags-givarens befintliga verktyg hade varit vissa natverksfunktioner, da en del avdessa verktyg ar utformade for att koras pa en server och presentera data via enwebblasare. En anvandare skulle da kunna hitta en region i genomet av intres-se i nagot av dessa verktyg, och valja att visualisera denna region med nagotav de utvecklade verktygen genom att kommandon skickas over natverket ochgenererar bilder som sedan visas i webblasaren.

16

Page 27: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

6 Slutsatser

De fyra efterfragade verktygen ar fullt funktionella, testade pa flera plattfor-mar, och kan anvandas for att pa olika satt visualisera data genererad fran hel-genomsekvensering. Fortfarande finns mycket utrymme for forbattring av denutvecklade mjukvaran. De skulle kunna integreras battre med annan befintligmjukvara och en mangd onskvarda funktioner finns kvar att implementera, meni stort utgor verktygen en god bas att fortsatta arbeta pa i framtiden.

17

Page 28: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

7 Kallforteckning

Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A.,Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G. and Dur-bin, R. (2011) ‘The variant call format and VCFtools’, Bioinformatics, 27(15),pp. 2156–2158

Doniger et al., 2008, A Catalog of Neutral and Deleterious Polymorphism inYeast, PLoS Genetics, 4(8), e1000183

Feuk, L., Carson, A.R. and Scherer, S.W. (2006) ‘Structural variation in thehuman genome’, Nature Reviews Genetics, 7(2), pp. 85–97.

Hofmeister, W., Nilsson, D., Topa, A., Anderlid, B.., Darki, F., Matsson, H.,Tapia Paez, I., Klingberg, T., Samuelsson, L., Wirta, V., Vezzi, F., Kere, J.,Nordenskjold, M., Syk Lundberg, E. and Lindstrand, A. (2014) ‘CTNND2–acandidate gene for reading problems and mild intellectual disability’, Journal ofMedical Genetics, 52(2), pp. 111–122.

J Schmutz et al. (2004) ‘Quality assessment of the human genome sequence’,Nature, 429(6990), pp. 365-368

Juan, L. et al., 2014. The personal genome browser: visualizing functions ofgenetic variants. Nucleic acids research, 42(Web Server issue), pp.W192–197.

Krzywinski, M. et al., 2009. Circos: An information aesthetic for comparati-ve genomics. Genome Research, 19(9), pp.1639–1645.

G.J., Walker, B.A. and Davies, F.E. (2012) ‘The genetic architecture of multiplemyeloma’, Nature Reviews Cancer, 12(5), pp. 335–348

O’Connor, C. (2008) Karyotyping for chromosomal abnormalities. Nature Educa-tion 1(1):27

Thorvaldsdottir, H., Robinson, J.T. & Mesirov, J.P., 2013. Integrative GenomicsViewer (IGV): high-performance genomics data visualization and exploration.Briefings in bioinformatics, 14(2), pp.178–192.

Zhao, Z., Liao, L., Cao, Y., Jiang, X. and Zhao, R.C. (2005) ‘Establishmentand properties of fetal dermis-derived mesenchymal stem cell lines: Plasticityin vitro and hematopoietic protection in vivo’, Bone Marrow Transplantation,36(4), pp. 355–365.

18

Page 29: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Bilaga A Figurer

Figur 1: Exempel pa ett cirkeldiagram (Morgan, Walker, och Davies, 2012,p.342)

Page 30: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Figur 2: Karyotyp (Zhao et al, 2005, p.360)

Page 31: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Figur 3: Gemensamma funktioner hos visualiseringsverktygen

Page 32: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Figur 4: Skarmdump av cirkeldiagrammet i in helhet.

Page 33: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Figur 5: Skarmdump som visar tackningsdiagrammet i sin helhet.

Page 34: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Figur 6: Skarmdump som visar karyogrammet i sin helhet.

Page 35: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Figur 7: Skarmdump som visar interaktionsvarmekartan i sin helhet.

Page 36: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to
Page 37: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Bilaga B Kallkod

Fullstandig kallkod kan hittas har: http://bitbucket.org/Ralars/kex-arbete/src

Page 38: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Graphics.py import sys import data import circos import coverage import karyogram import heatmap from PySide.QtCore import * from PySide.QtGui import * #The main window of the program. Handles a central view widget, and menus and toolbars. class WGSView(QMainWindow): def __init__(self): super().__init__() self.initmainwin() self.activeScene = False def initmainwin(self): self.setWindowTitle('WGS') self.resize(800,600) #Center the main window on the user's screen frameGeo = self.frameGeometry() desktopCenter = QDesktopWidget().availableGeometry().center() frameGeo.moveCenter(desktopCenter) self.move(frameGeo.topLeft()) #Adds a status bar to the main window self.statusBar() #Create actions for menus and toolbars, connect to functions self.newCircAct = QAction('New CIRCOS',self) self.newCircAct.triggered.connect(self.newCirc) self.newCovDiagramAct = QAction('New coverage diagram',self) self.newCovDiagramAct.triggered.connect(self.newCovDiagram) self.newKaryogramAct = QAction('New karyogram',self) self.newKaryogramAct.triggered.connect(self.newKaryogram) self.newHeatmapAct = QAction('New heatmap',self) self.newHeatmapAct.triggered.connect(self.newHeatmap) self.exitAct = QAction('&Exit',self) self.exitAct.triggered.connect(self.close) #Create menus, toolbar, and add actions self.menubar = self.menuBar() self.fileMenu = self.menubar.addMenu('&File') self.fileMenu.addAction(self.newCircAct) self.fileMenu.addAction(self.newCovDiagramAct) self.fileMenu.addAction(self.newKaryogramAct) self.fileMenu.addAction(self.newHeatmapAct) self.fileMenu.addAction(self.exitAct) self.show()

Page 39: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

#Creates and initializes a new circos diagram def newCirc(self): startNew = True if self.activeScene: newSceneDialog = QDialog() newSceneDialog.setWindowTitle("Are you sure?") okButton = QPushButton('Ok', newSceneDialog) okButton.clicked.connect(newSceneDialog.accept) cancelButton = QPushButton('Cancel', newSceneDialog) cancelButton.clicked.connect(newSceneDialog.reject) textLabel = QLabel("The current scene will be lost. Are you sure?") newSceneDialog.layout = QGridLayout(newSceneDialog) newSceneDialog.layout.addWidget(textLabel,0,0,1,2) newSceneDialog.layout.addWidget(okButton,1,0) newSceneDialog.layout.addWidget(cancelButton,1,1) choice = newSceneDialog.exec_() if choice == QDialog.Accepted: startNew = True else: startNew = False if startNew: #Remove old toolbars, menu items, clear up resources try: self.removeToolBar(self.tools) self.tools.hide() self.tools.destroy() self.view.destroy() except: pass self.fileMenu.clear() self.scene = circos.CircosScene(self) self.view = circos.CircosView(self.scene) self.setCentralWidget(self.view) self.viewSettingsAct = QAction('&Settings',self) self.viewSettingsAct.triggered.connect(self.view.viewSettings) self.exportImageAct = QAction('Export image',self) self.exportImageAct.triggered.connect(self.exportImage) self.fileMenu.addAction(self.newCircAct) self.fileMenu.addAction(self.newCovDiagramAct) self.fileMenu.addAction(self.newKaryogramAct) self.fileMenu.addAction(self.newHeatmapAct) self.fileMenu.addAction(self.viewSettingsAct) self.fileMenu.addAction(self.exportImageAct) self.fileMenu.addAction(self.exitAct) self.showChInfoAct = QAction('&Chromosomes',self) self.showChInfoAct.triggered.connect(self.view.showChInfo) self.updateSceneAct = QAction('&Update CIRCOS',self) self.updateSceneAct.triggered.connect(self.updateScene) self.toggleCoverageAct = QAction('&Toggle coverage',self)

Page 40: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

self.toggleCoverageAct.triggered.connect(self.view.toggleCoverage) self.addImageAct = QAction('&Add Image to plot', self) self.addImageAct.triggered.connect(self.view.addImage) self.importColorTabAct = QAction('&Color regions with file', self) self.importColorTabAct.triggered.connect(self.view.importColorTab) self.tools = self.addToolBar('Chromosome tools') self.tools.addAction(self.showChInfoAct) self.tools.addAction(self.updateSceneAct) self.tools.addAction(self.toggleCoverageAct) self.tools.addAction(self.addImageAct) self.tools.addAction(self.importColorTabAct) #Some confusion with python bindings makes getOpenFileName return a tuple. #First element is the name of the file. self.statusBar().showMessage("Initializing new CIRCOS..") #Null string if cancel is pressed -- FIX: handle. tabFile = QFileDialog.getOpenFileName(None,"Specify TAB file",QDir.currentPath(), "TAB files (*.tab)")[0] vcfFile = QFileDialog.getOpenFileName(None,"Specify VCF file",QDir.currentPath(), "VCF files (*.vcf)")[0] self.statusBar().showMessage("Reading TAB..") self.reader = data.Reader() self.reader.readTab(tabFile) self.statusBar().showMessage("Reading VCF..") self.reader.readVCF(vcfFile) self.view.chromosomes = self.reader.returnChrList() self.view.numChr = len(self.view.chromosomes) self.view.coverageNormLog = self.reader.returnCoverageNormLog() self.view.coverageNorm = self.reader.returnCoverageNorm() self.view.tabName = self.reader.returnTabName() self.view.vcfName = self.reader.returnVcfName() self.view.addFileText() self.view.createChInfo() self.tools.show() self.view.chromosomeItems = [] #Create a dict representing colors for the 24 default chromosomes self.view.chromoColors = {} color = self.view.startColor for i in range(24): self.view.chromoColors[self.view.chromosomes[i].name] = color color = color.darker(105) self.updateScene() self.activeScene = True self.show() self.view.showChInfo() #Method for updating the circos view def updateScene(self): self.statusBar().showMessage("Drawing..") self.view.initscene() self.statusBar().clearMessage()

Page 41: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

#Exports anything in the current view as a png image def exportImage(self): #Set default name to same as vcf file if this is loaded, otherwise use tab try: defaultPath = QDir.currentPath() + "/" + self.reader.returnVcfName() defaultPath = defaultPath.replace("vcf","png") except: defaultPath = QDir.currentPath() + "/" + self.reader.returnTabName() defaultPath = defaultPath.replace("tab","png") savePath = QFileDialog.getSaveFileName(self, "Export image", defaultPath, "Images (*.png)")[0] viewPixMap = QPixmap.grabWidget(self.view) viewPixMap.save(savePath) #Creates and initializes a new coverage diagram def newCovDiagram(self): startNew = True if self.activeScene: newSceneDialog = QDialog() newSceneDialog.setWindowTitle("Are you sure?") okButton = QPushButton('Ok', newSceneDialog) okButton.clicked.connect(newSceneDialog.accept) cancelButton = QPushButton('Cancel', newSceneDialog) cancelButton.clicked.connect(newSceneDialog.reject) textLabel = QLabel("The current scene will be lost. Are you sure?") newSceneDialog.layout = QGridLayout(newSceneDialog) newSceneDialog.layout.addWidget(textLabel,0,0,1,2) newSceneDialog.layout.addWidget(okButton,1,0) newSceneDialog.layout.addWidget(cancelButton,1,1) choice = newSceneDialog.exec_() if choice == QDialog.Accepted: startNew = True else: startNew = False if startNew: #Remove old toolbars, menu items, clear up resources try: self.removeToolBar(self.tools) self.tools.hide() self.tools.destroy() self.view.destroy() except: pass self.fileMenu.clear() self.statusBar().showMessage("Initializing new coverage diagram..") tabFile = QFileDialog.getOpenFileName(None,"Specify TAB file",QDir.currentPath(), "TAB files (*.tab)")[0] self.statusBar().showMessage("Reading TAB..") self.reader = data.Reader() self.reader.readTab(tabFile)

Page 42: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

self.statusBar().clearMessage() self.view = coverage.CoverageView(self.reader.chromosomes) self.scrollArea = QScrollArea(self) self.scrollArea.setWidget(self.view) self.scrollArea.setWidgetResizable(True) self.setCentralWidget(self.scrollArea) self.view.coverageNormLog = self.reader.returnCoverageNormLog() self.view.coverageNorm = self.reader.returnCoverageNorm() self.viewSettingsAct = QAction('&Settings',self) self.viewSettingsAct.triggered.connect(self.view.viewSettings) self.exportImageAct = QAction('Export image',self) self.exportImageAct.triggered.connect(self.exportImage) self.fileMenu.addAction(self.newCircAct) self.fileMenu.addAction(self.newCovDiagramAct) self.fileMenu.addAction(self.newKaryogramAct) self.fileMenu.addAction(self.newHeatmapAct) self.fileMenu.addAction(self.viewSettingsAct) self.fileMenu.addAction(self.exportImageAct) self.fileMenu.addAction(self.exitAct) self.tools = self.addToolBar('Coverage tools') self.addPlotAct = QAction('Add subplot', self) self.addPlotAct.triggered.connect(self.view.addChromoPlot) self.updateLayoutAct = QAction('Update layout', self) self.updateLayoutAct.triggered.connect(self.view.arrangePlots) self.tools.addAction(self.addPlotAct) self.tools.addAction(self.updateLayoutAct) self.tools.show() self.show() self.activeScene = True #Creates and initializes a new karyotype diagram def newKaryogram(self): startNew = True if self.activeScene: newSceneDialog = QDialog() newSceneDialog.setWindowTitle("Are you sure?") okButton = QPushButton('Ok', newSceneDialog) okButton.clicked.connect(newSceneDialog.accept) cancelButton = QPushButton('Cancel', newSceneDialog) cancelButton.clicked.connect(newSceneDialog.reject) textLabel = QLabel("The current scene will be lost. Are you sure?") newSceneDialog.layout = QGridLayout(newSceneDialog) newSceneDialog.layout.addWidget(textLabel,0,0,1,2) newSceneDialog.layout.addWidget(okButton,1,0) newSceneDialog.layout.addWidget(cancelButton,1,1) choice = newSceneDialog.exec_() if choice == QDialog.Accepted: startNew = True else: startNew = False if startNew:

Page 43: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

#Remove old toolbars, menu items, clear up resources try: self.removeToolBar(self.tools) self.tools.hide() self.tools.destroy() self.view.destroy() except: pass self.fileMenu.clear() self.statusBar().showMessage("Initializing karyogram..") #Null string if cancel is pressed -- FIX: handle. tabFile = QFileDialog.getOpenFileName(None,"Specify TAB file",QDir.currentPath(), "TAB files (*.tab)")[0] vcfFile = QFileDialog.getOpenFileName(None,"Specify VCF file",QDir.currentPath(), "VCF files (*.vcf)")[0] cytoFile = QFileDialog.getOpenFileName(None,"Specify cytoband file",QDir.currentPath(), "cytotab files (*.txt)")[0] self.statusBar().showMessage("Reading TAB..") self.reader = data.Reader() self.reader.readTab(tabFile) self.statusBar().showMessage("Reading VCF..") self.reader.readVCF(vcfFile) self.statusBar().showMessage("Reading cytoband file..") self.reader.readCytoTab(cytoFile) self.statusBar().clearMessage() self.view = karyogram.KaryogramView(self.reader.chromosomes, self.reader.returnCytoTab()) self.view.createChInfo() self.setCentralWidget(self.view) self.viewSettingsAct = QAction('&Settings',self) self.viewSettingsAct.triggered.connect(self.view.viewSettings) self.exportImageAct = QAction('Export image',self) self.exportImageAct.triggered.connect(self.exportImage) self.fileMenu.addAction(self.newCircAct) self.fileMenu.addAction(self.newCovDiagramAct) self.fileMenu.addAction(self.newKaryogramAct) self.fileMenu.addAction(self.newHeatmapAct) self.fileMenu.addAction(self.viewSettingsAct) self.fileMenu.addAction(self.exportImageAct) self.fileMenu.addAction(self.exitAct) self.tools = self.addToolBar('Karyogram tools') self.updateKaryogramAct = QAction('Update karyogram', self) self.updateKaryogramAct.triggered.connect(self.view.updateItems) self.showChInfoAct = QAction('&Chromosomes',self) self.showChInfoAct.triggered.connect(self.view.showChInfo) self.tools.addAction(self.showChInfoAct) self.tools.addAction(self.updateKaryogramAct) self.tools.show() self.show()

Page 44: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

self.activeScene = True #Creates and initializes a new heatmap diagram def newHeatmap(self): startNew = True if self.activeScene: newSceneDialog = QDialog() newSceneDialog.setWindowTitle("Are you sure?") okButton = QPushButton('Ok', newSceneDialog) okButton.clicked.connect(newSceneDialog.accept) cancelButton = QPushButton('Cancel', newSceneDialog) cancelButton.clicked.connect(newSceneDialog.reject) textLabel = QLabel("The current scene will be lost. Are you sure?") newSceneDialog.layout = QGridLayout(newSceneDialog) newSceneDialog.layout.addWidget(textLabel,0,0,1,2) newSceneDialog.layout.addWidget(okButton,1,0) newSceneDialog.layout.addWidget(cancelButton,1,1) choice = newSceneDialog.exec_() if choice == QDialog.Accepted: startNew = True else: startNew = False if startNew: #Remove old toolbars, menu items, clear up resources try: self.removeToolBar(self.tools) self.tools.hide() self.tools.destroy() self.view.destroy() except: pass self.fileMenu.clear() self.statusBar().showMessage("Initializing new heatmap..") tabFile = QFileDialog.getOpenFileName(None,"Specify TAB file",QDir.currentPath(), "TAB files (*.tab)")[0] vcfFile = QFileDialog.getOpenFileName(None,"Specify VCF file",QDir.currentPath(), "VCF files (*.vcf)")[0] self.statusBar().showMessage("Reading TAB..") self.reader = data.Reader() self.reader.readTab(tabFile) self.statusBar().showMessage("Reading VCF..") self.reader.readVCF(vcfFile) self.statusBar().clearMessage() self.view = heatmap.HeatmapView(self.reader.chromosomes) self.scrollArea = QScrollArea(self) self.scrollArea.setWidget(self.view) self.scrollArea.setWidgetResizable(True) self.setCentralWidget(self.scrollArea) self.viewSettingsAct = QAction('&Settings',self)

Page 45: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

self.viewSettingsAct.triggered.connect(self.view.viewSettings) self.exportImageAct = QAction('Export image',self) self.exportImageAct.triggered.connect(self.exportImage) self.fileMenu.addAction(self.newCircAct) self.fileMenu.addAction(self.newCovDiagramAct) self.fileMenu.addAction(self.newKaryogramAct) self.fileMenu.addAction(self.newHeatmapAct) self.fileMenu.addAction(self.viewSettingsAct) self.fileMenu.addAction(self.exportImageAct) self.fileMenu.addAction(self.exitAct) self.tools = self.addToolBar('Coverage tools') self.addHeatmapAct = QAction('Add heatmap', self) self.addHeatmapAct.triggered.connect(self.view.addHeatmap) self.updateLayoutAct = QAction('Update layout', self) self.updateLayoutAct.triggered.connect(self.view.arrangePlots) self.tools.addAction(self.addHeatmapAct) self.tools.addAction(self.updateLayoutAct) self.tools.show() self.show() self.activeScene = True

Page 46: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Circos.py import sys import random import math import data from PySide.QtCore import * from PySide.QtGui import * class CircosView(QGraphicsView): def __init__(self,scene): super().__init__(scene) self.setRenderHints(QPainter.Antialiasing) self.resize(800,600) self.show() self.chromosomes = [] self.chromosomeItems = [] self.coverageItems = [] self.chromosome_connection_list = [] self.regionItems = [] self.numChr = 0 self.bpWindow = 500 self.useCoverageLog = True self.minCoverage = 0.5 self.maxCoverage = 1.5 self.startColor = QColor.fromRgb(243,241,172) self.numDispChromos = 23 self.connWidth = 1 self.showChrNames = True self.createSettings() def createSettings(self): self.settingsModel = QStandardItemModel() #create header labels to distinguish different settings. verticalHeaders = ["bpWindow", "useCoverageLog", "minCoverage", "maxCoverage", "connectionWidth"] self.settingsModel.setVerticalHeaderLabels(verticalHeaders) bpWinText = QStandardItem("BP Resolution (kb)") bpWinText.setEditable(False) bpWinText.setToolTip("No. of base pairs (x1000) used to average data in calculations.\nSmaller values may decrease performance.") bpWinData = QStandardItem() bpWinData.setData(self.bpWindow,0) bpWinData.setEditable(True) useCovLog = QStandardItem("Use log2 of coverage") useCovLog.setEditable(False) useCovLog.setToolTip("Use log2(value) for coverage values when displaying coverage graph?")

Page 47: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

useCovLogCheck = QStandardItem() useCovLogCheck.setCheckable(True) useCovLogCheck.setCheckState(Qt.Checked) useCovLogCheck.setEditable(False) minCovLimitText = QStandardItem("Min.coverage value (%)") minCovLimitText.setEditable(False) minCovLimitText.setToolTip("Minimum coverage value in coverage graph,\nin percentage of average coverage value of genome.") minCovLimitData = QStandardItem() minCovLimitData.setData(self.minCoverage*100,0) minCovLimitData.setEditable(True) maxCovLimitText = QStandardItem("Max. coverage value (%)") maxCovLimitText.setEditable(False) maxCovLimitText.setToolTip("Maximum coverage value in coverage graph,\nin percentage of average coverage value of genome.") maxCovLimitData = QStandardItem() maxCovLimitData.setData(self.maxCoverage*100,0) maxCovLimitData.setEditable(True) connPenWidthText = QStandardItem("Width of connections") connPenWidthText.setEditable(False) connPenWidthText.setToolTip("Set the size (in pixels) of the connection lines") connPenWidthData = QStandardItem() connPenWidthData.setData(self.connWidth,0) connPenWidthData.setEditable(True) showChrNameText = QStandardItem("Show chromosome names") showChrNameText.setEditable(False) showChrNameText.setToolTip("Show or hide the chromosome names on the circos diagram") showChrNameCheck = QStandardItem() showChrNameCheck.setCheckable(True) showChrNameCheck.setCheckState(Qt.Checked) showChrNameCheck.setEditable(False) self.settingsModel.setItem(0,0,bpWinText) self.settingsModel.setItem(0,1,bpWinData) self.settingsModel.setItem(1,0,useCovLog) self.settingsModel.setItem(1,1,useCovLogCheck) self.settingsModel.setItem(2,0,minCovLimitText) self.settingsModel.setItem(2,1,minCovLimitData) self.settingsModel.setItem(3,0,maxCovLimitText) self.settingsModel.setItem(3,1,maxCovLimitData) self.settingsModel.setItem(4,0,connPenWidthText) self.settingsModel.setItem(4,1,connPenWidthData) self.settingsModel.setItem(5,0,showChrNameText) self.settingsModel.setItem(5,1,showChrNameCheck) self.settingsModel.itemChanged.connect(self.updateSettings) def viewSettings(self): self.settingsList = QTableView() self.settingsList.setEditTriggers(QAbstractItemView.AllEditTriggers) self.settingsList.setShowGrid(False) self.settingsList.horizontalHeader().hide() self.settingsList.verticalHeader().hide() self.settingsList.setModel(self.settingsModel)

Page 48: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

self.settingsList.setTextElideMode(Qt.ElideNone) self.settingsDia = QDialog(self) self.settingsDia.setWindowTitle("Settings") applyButton = QPushButton('Apply', self.settingsDia) applyButton.clicked.connect(self.settingsDia.accept) self.settingsDia.layout = QGridLayout(self.settingsDia) self.settingsDia.layout.addWidget(self.settingsList,0,0,1,3) self.settingsDia.layout.addWidget(applyButton,1,0,1,1) self.settingsDia.show() def updateSettings(self,item): if item.row() == 0: self.bpWindow = item.data(0) if item.row() == 1: self.useCoverageLog = not self.useCoverageLog if item.row() == 2: self.minCoverage = item.data(0)/100 if item.row() == 3: self.maxCoverage = item.data(0)/100 if item.row() == 4: self.connWidth = item.data(0) if item.row() == 5: self.showChrNames = not self.showChrNames #Sums the end bp for every chromosome with display toggled on def returnTotalDisplayedBP(self): totalDispBP = 0 for chromo in self.chromosomes: if chromo.display: totalDispBP += int(chromo.end) return totalDispBP #Creates data model for info window def createChInfo(self): self.chModel = QStandardItemModel() topstring = ["Name","Length","No. of variants","Display","Draw connections"] self.chModel.setHorizontalHeaderLabels(topstring) for chromo in self.chromosomes: infostring = [chromo.name,chromo.end,str(len(chromo.variants))] infoItems = [QStandardItem(string) for string in infostring] dispCheckItem = QStandardItem() dispCheckItem.setCheckable(False) connCheckItem = QStandardItem() connCheckItem.setCheckable(False) connCheckItem.setCheckState(Qt.Unchecked) checkList = [dispCheckItem, connCheckItem] infoItems.extend(checkList) #only keep chromosomes up to MT (no. 24), but toggle MT display off as default #do not add GLxxxx chr (no.25 and up) if (self.chromosomes.index(chromo) < 24): dispCheckItem.setCheckState(Qt.Checked) self.chModel.appendRow(infoItems)

Page 49: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

elif (self.chromosomes.index(chromo) == 24): dispCheckItem.setCheckState(Qt.Unchecked) chromo.display = False self.chModel.appendRow(infoItems) else: chromo.display = False #Creates a window with chromosomes and toggles, info def showChInfo(self): #if any earlier window is open, close it try: self.chDia.close() except: pass self.chList = QTableView() self.chList.verticalHeader().hide() self.chList.setSelectionMode(QAbstractItemView.ExtendedSelection) self.chList.setSelectionBehavior(QAbstractItemView.SelectRows) self.chList.setEditTriggers(QAbstractItemView.NoEditTriggers) self.chList.setShowGrid(False) self.chList.setModel(self.chModel) self.chList.resizeColumnsToContents() #Give the length column some extra space.. curWidth = self.chList.columnWidth(1) self.chList.setColumnWidth(1,curWidth+20) self.chDia = QDialog(self) self.chDia.setWindowTitle("Chromosome info") #Button for toggling display of selected chromosomes in the scene togButton = QPushButton('Toggle display', self.chDia) togButton.clicked.connect(self.toggleDisp) #Button for viewing selected chromosome variants viewVarButton = QPushButton('View variants', self.chDia) viewVarButton.clicked.connect(self.viewVariants) #Button for adding variants addVariantButton = QPushButton('Add variant', self.chDia) addVariantButton.clicked.connect(self.addVariant) #Button for toggling connections connButton = QPushButton('Toggle connections', self.chDia) connButton.clicked.connect(self.toggleConnections) self.chDia.layout = QGridLayout(self.chDia) self.chDia.layout.addWidget(self.chList,0,0,1,4) self.chDia.layout.addWidget(togButton,1,0,1,1) self.chDia.layout.addWidget(viewVarButton,1,1,1,1) self.chDia.layout.addWidget(addVariantButton,1,2,1,1) self.chDia.layout.addWidget(connButton,1,3,1,1) self.chDia.setMinimumSize(500,400) self.chDia.show() #Creates data model for variants in given chromosome def createVariantInfo(self, chromo): self.varModel = QStandardItemModel() topstring = ['START', 'ALT', 'END', 'GENE(S)']

Page 50: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

self.varModel.setHorizontalHeaderLabels(topstring) #Adding variant info to a list (except the info field, which has index=2 in the variant list) for variant in chromo.variants: infoitem = [] infoitem.append(QStandardItem(variant[0])) infoitem.append(QStandardItem(variant[1])) infoitem.append(QStandardItem(variant[3])) infoitem.append(QStandardItem(variant[4])) self.varModel.appendRow(infoitem) #Creates a popup containing variant info in a table. #Could be implemented in a better way than multiple dialogues.. def viewVariants(self): selectedIndexes = self.chList.selectedIndexes() selectedRows = [index.row() for index in selectedIndexes] selectedRows = set(selectedRows) for row in selectedRows: chromo = self.chromosomes[row] self.createVariantInfo(chromo) viewVarDia = QDialog(self) viewVarDia.setWindowTitle("Variants in contig " + chromo.name) varList = QTableView() varList.setMinimumSize(440,400) varList.verticalHeader().hide() varList.setEditTriggers(QAbstractItemView.NoEditTriggers) varList.setModel(self.varModel) varList.resizeColumnToContents(1) viewVarDia.layout = QGridLayout(viewVarDia) viewVarDia.layout.addWidget(varList,0,0) viewVarDia.show() def addVariant(self): #Adds a variant to selected chromosomes. Some models still have to be updated. #Not sure how to best handle input yet. selectedIndexes = self.chList.selectedIndexes() selectedRows = [index.row() for index in selectedIndexes] selectedRows = set(selectedRows) for row in selectedRows: chromo = self.chromosomes[row] addVariantDialog = QDialog() addVariantDialog.setWindowTitle("Add variant in contig " + chromo.name) applyButton = QPushButton('Ok', addVariantDialog) applyButton.clicked.connect(addVariantDialog.accept) cancelButton = QPushButton('Cancel', addVariantDialog) cancelButton.clicked.connect(addVariantDialog.reject) locBoxValidator = QIntValidator(self) locBoxValidator.setBottom(0) locABox = QLineEdit() locBBox = QLineEdit() locABox.setValidator(locBoxValidator) locBBox.setValidator(locBoxValidator) chromoBox = QComboBox()

Page 51: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

chromoStrings = [chromo.name for chromo in self.chromosomes if not "GL" in chromo.name] chromoBox.addItems(chromoStrings) altBox = QLineEdit() geneBox = QLineEdit() locALabel = QLabel("Position A: ") chromoLabel = QLabel("Chromosome B: ") locBLabel = QLabel("Position B: ") altLabel = QLabel("ALT: ") geneLabel = QLabel("GENE(S): ") addVariantDialog.layout = QGridLayout(addVariantDialog) addVariantDialog.layout.addWidget(locALabel,0,0) addVariantDialog.layout.addWidget(locABox,0,1) addVariantDialog.layout.addWidget(chromoLabel,1,0) addVariantDialog.layout.addWidget(chromoBox,1,1) addVariantDialog.layout.addWidget(locBLabel,2,0) addVariantDialog.layout.addWidget(locBBox,2,1) addVariantDialog.layout.addWidget(altLabel,3,0) addVariantDialog.layout.addWidget(altBox,3,1) addVariantDialog.layout.addWidget(geneLabel,4,0) addVariantDialog.layout.addWidget(geneBox,4,1) addVariantDialog.layout.addWidget(applyButton,5,0) addVariantDialog.layout.addWidget(cancelButton,5,1) choice = addVariantDialog.exec_() if choice == QDialog.Accepted: #END field should only be filled if chrB is the same if chromoBox.currentText() == chromo.name: end = locBBox.text() else: end = "." chromo.addVariant(locABox.text(),altBox.text(),"",end,geneBox.text(),"") def addImage(self): size = self.size() outerChrRect = QRect(QPoint(50,50), QPoint(size.height()-50,size.height()-50)) fileName = QFileDialog.getOpenFileName(None,"Specify Image file",QDir.currentPath(), "PNG files (*.png *.jpg *.bmp)") pixmap = QPixmap(fileName[0]) if pixmap.isNull(): print("is null") #Scaling the pixmap to 70% of the cirkos-diagram size #pixmap = pixmap.scaled(outerChrRect.size()*0.7) pixmapItem = QGraphicsPixmapItem(pixmap) #Moving the image to the right of the cirkos-diagram pixmapItem.setPos(outerChrRect.center().x() + (outerChrRect.width()/2) + (outerChrRect.width()/10), outerChrRect.center().y() - (pixmapItem.boundingRect().height()/2)) self.scene().addItem(pixmapItem) pixmapItem.setFlag(QGraphicsItem.ItemIsMovable) def toggleDisp(self): #The row associated with the item corresponds to a chromosome selectedIndexes = self.chList.selectedIndexes() selectedRows = [index.row() for index in selectedIndexes]

Page 52: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

#Convert to a set to get unique rows, since every column in the table is selected selectedRows = set(selectedRows) for row in selectedRows: dispConnItem = self.chModel.item(row,4) dispItem = self.chModel.item(row,3) if (dispItem.checkState() == Qt.Checked): dispItem.setCheckState(Qt.Unchecked) self.chromosomes[row].display = False dispConnItem.setCheckState(Qt.Unchecked) self.chromosomes[row].display_connections = False self.numDispChromos -= 1 else: dispItem.setCheckState(Qt.Checked) self.chromosomes[row].display = True self.numDispChromos += 1 self.initscene() def toggleConnections(self): selectedIndexes = self.chList.selectedIndexes() selectedRows = [index.row() for index in selectedIndexes] selectedRows = set(selectedRows) for row in selectedRows: dispConnItem = self.chModel.item(row,4) if self.chromosomes[row].display_connections: dispConnItem.setCheckState(Qt.Unchecked) self.chromosomes[row].display_connections = False else: dispConnItem.setCheckState(Qt.Checked) self.chromosomes[row].display_connections = True self.initscene() def toggleCoverage(self): if self.completeCoveragePathItem.isVisible(): self.completeCoveragePathItem.hide() else: self.completeCoveragePathItem.show() #Method for defining or reinitializing the chromosome items. def makeItems(self): #To determine the length (therefore angle below) of a chromosome, let 360 deg represent #total number of bp to be displayed. The angle to increment for each chromosome #is then (chromosome.end / totalDispBP)*360. Cut off 1 deg for separation. size = self.size() outerChrRect = QRect(QPoint(50,50), QPoint(size.height()-50,size.height()-50)) innerChrRect = QRect(QPoint(100,100),QPoint(size.height()-100,size.height()-100)) curAngle = 0 totalDispBP = self.returnTotalDisplayedBP() for chromo in self.chromosomes: if not chromo.display: continue angleIncr = (int(chromo.end) / totalDispBP) * 360 #Define two painter paths constructing circle sectors

Page 53: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

outer = QPainterPath() inner = QPainterPath() outer.moveTo(outerChrRect.center()) outer.arcTo(outerChrRect,-curAngle, -angleIncr+1) inner.moveTo(innerChrRect.center()) inner.arcTo(innerChrRect,-curAngle, -angleIncr+1) #Saving the angles for later use, see drawConnections angles = [curAngle, angleIncr] del self.chromosome_angle_list[self.chromosomes.index(chromo)] self.chromosome_angle_list.insert(self.chromosomes.index(chromo), angles) curAngle += angleIncr #Removes any leftover painting path that may cause ugly lines in the middle leftoverArea = QPainterPath() leftoverArea.moveTo(innerChrRect.center()) leftoverArea.arcTo(innerChrRect,0,360) #Remove the inner circle sector from the outer sector to get the area to display chromoPath = outer.subtracted(inner) chromoPath = chromoPath.subtracted(leftoverArea) #Finally, construct a graphics item from the path, to be added to the scene if self.showChrNames: nameString = chromo.name else: nameString = "" chromoItem = ChromoGraphicItem(chromoPath, nameString) chromoItem.setToolTip(chromo.name + ": " + chromo.end + " bp, " + str(len(chromo.variants)) + " variants") #Look up the chromo name in the color dict for its defined color currentColor = self.chromoColors[chromo.name] chromoItem.setBrush(currentColor) #Add the finished graphics item to a list self.chromosomeItems.append(chromoItem) #Creates a coverage graph. FIX: maybe add bp delineation? def createCoverage(self): size = self.size() totalDispBP = self.returnTotalDisplayedBP() inRect = QRect(QPoint(150,150),QPoint(size.height()-150,size.height()-150)) outRect = QRect(QPoint(125,125),QPoint(size.height()-125,size.height()-125)) chrStartAngle = 0 if self.useCoverageLog: normValue = self.coverageNormLog else: normValue = self.coverageNorm centerPoint = inRect.center() for chromo in self.chromosomes: if not chromo.display: continue chrEndAngle = (int(chromo.end) / totalDispBP) * 360 - 1 innerPath = QPainterPath() innerPath.moveTo(centerPoint) outerPath = QPainterPath() outerPath.moveTo(centerPoint)

Page 54: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

#No. of coverage data items ranging from 249250 to 59373 -- far too much to draw.. #sum a number of entries as specified in bpWindow and create an average if self.useCoverageLog: coverageChunks = [chromo.coverageLog[i:i+self.bpWindow] for i in range(0,len(chromo.coverageLog),self.bpWindow)] else: coverageChunks = [chromo.coverage[i:i+self.bpWindow] for i in range(0,len(chromo.coverage),self.bpWindow)] angleIncr = ((chrEndAngle) / len(coverageChunks)) curAngle = chrStartAngle for chunk in coverageChunks: avgCoverage = sum(chunk) / len(chunk) #for chromosomes up to 22, 150% of norm is max and 50% is min (default). #find the tVal using linear interpolation between these two points if (avgCoverage > normValue*self.maxCoverage): avgCoverage = normValue*self.maxCoverage if (avgCoverage < normValue*self.minCoverage): avgCoverage = normValue*self.minCoverage tVal = (avgCoverage - normValue*self.minCoverage)/(normValue*self.maxCoverage - normValue*self.minCoverage) innerPath.arcMoveTo(inRect,-curAngle) outerPath.arcMoveTo(outRect,-curAngle) lineBetween = QLineF(outerPath.currentPosition(),innerPath.currentPosition()) outerPath.moveTo(lineBetween.pointAt(0.5)) outerPath.lineTo(lineBetween.pointAt(tVal)) curAngle += angleIncr chrStartAngle += chrEndAngle + 1 covItem = QGraphicsPathItem(outerPath) self.coverageItems.append(covItem) def drawConnections(self): #Loops through the full list of chromosomes and checks if the connections should be displayed or not size = self.size() outerChrRect = QRect(QPoint(50,50), QPoint(size.height()-50,size.height()-50)) innerChrRect = QRect(QPoint(100,100),QPoint(size.height()-100,size.height()-100)) counter = 0 for index in range(len(self.chromosomes)): if not self.chromosomes[index].display_connections & self.chromosomes[index].display: continue #only create the connection list if it has not been initialized earlier if not self.chromosomes[index].connections: self.chromosomes[index].createConnections() for connection in self.chromosomes[index].connections: #The information is stored as string elements and needs to be converted to integers if connection[1] == 'X': ChrB=23 elif connection[1] == 'Y': ChrB=24 elif connection[1].startswith('G'): continue else:

Page 55: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

ChrB = int(connection[1]) if not self.chromosomes[ChrB-1].display: continue #The curAngle determines where on the circle the chromosome is located (also used in makeItems) curAngle_A = self.chromosome_angle_list[index][0] curAngle_B = self.chromosome_angle_list[ChrB-1][0] #The windows of each variant (WINA, WINB) are used to determine where on the chromosome the interaction is located bp_End_A = int(connection[2].split(',')[1]) ChrA_length = int(self.chromosomes[index].end) bp_End_B = int(connection[3].split(',')[1]) ChrB_length = int(self.chromosomes[ChrB-1].end) #A percentage of the total angle (used to draw the chromosome in makeItems) determines where on the #chromosome the connection is located angleIncr_A = (1-((ChrA_length - bp_End_A) / ChrA_length)) * (self.chromosome_angle_list[index][1]-2) angleIncr_B = (1-((ChrB_length - bp_End_B) / ChrB_length)) * (self.chromosome_angle_list[ChrB-1][1]-2) #A Path is created to assign the position for the connections tempPath = QPainterPath() #The arMoveTo() function is used to get the different points on each chromosome the connection is located tempPath.arcMoveTo(innerChrRect, - (curAngle_A + angleIncr_A)) posA = tempPath.currentPosition() tempPath.arcMoveTo(innerChrRect, - (curAngle_B + angleIncr_B)) posB = tempPath.currentPosition() centerPos = outerChrRect.center() #A Bezier curve is then created between these three points ConnectionPath = QPainterPath() ConnectionPath.moveTo(posA) ConnectionPath.quadTo(centerPos,posB) #The path is converted to a graphics path item ConnectionItem = QGraphicsPathItem(ConnectionPath) #The PathItem is given the color of chromosome B and a width (default is 1 pixel wide) pen = QPen(self.chromoColors[self.chromosomes[ChrB-1].name], self.connWidth) ConnectionItem.setPen(pen) #Creating a rectangle (1x1 pixels) around each posB point for use when heat mapping the connections rect = QRect(posB.toPoint(),QSize(1,1)) rect.moveCenter(posB.toPoint()) #self.scene().addItem(QGraphicsRectItem(rect)) ConnectionInfo = [ConnectionItem, rect, posA, posB, (ChrB-1), counter] #The item is added to a list self.chromosomes[index].connection_list.append(ConnectionInfo) counter = counter + 1 #Checking to see if any neighbouring connections are close to eachother, if they are -> create a color gradient for both the #neighbouring connection lines, that gets darker closer to the connection for index1 in range(len(self.chromosomes)):

Page 56: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

for connItem1 in self.chromosomes[index1].connection_list: for index2 in range(len(self.chromosomes)): for connItem2 in self.chromosomes[index2].connection_list: #check to see if one rectangle is comparing with itself if connItem1[5] == connItem2[5]: continue if connItem1[1].intersects(connItem2[1]): linearGrad = QLinearGradient(connItem1[2], connItem1[3]) linearGrad.setColorAt(0, self.chromoColors[self.chromosomes[connItem1[4]].name]) linearGrad.setColorAt(1, self.chromoColors[self.chromosomes[connItem1[4]].name].darker(300)) connItem1[0].setPen(QPen(QBrush(linearGrad), self.connWidth)) connItem2[0].setPen(QPen(QBrush(linearGrad), self.connWidth)) #Imports either a tab file with specified regions to color, or a cytoband file def importColorTab(self): fileName = QFileDialog.getOpenFileName(None, "Specify a color tab-file", QDir.currentPath(), "tab-files (*.tab *.txt)")[0] reader = data.Reader() if fileName.endswith("tab"): reader.readColorTab(fileName) colorTab = reader.returnColorTab() self.colorRegions(colorTab,False) else: reader.readCytoTab(fileName) colorTab = reader.returnCytoTab() self.colorRegions(colorTab,True) def colorRegions(self,colorTab,cytoband): size = self.size() outerChrRect = QRect(QPoint(50,50), QPoint(size.height()-50,size.height()-50)) innerChrRect = QRect(QPoint(100,100),QPoint(size.height()-100,size.height()-100)) colors = {'red': Qt.red, 'magenta': Qt.magenta, 'blue': Qt.blue, 'cyan': Qt.cyan, 'yellow': Qt.yellow, 'darkBlue': Qt.darkBlue} stainColors = {'acen':Qt.darkRed, 'gneg':Qt.white,'gpos100':Qt.black,'gpos25':Qt.lightGray,'gpos50':Qt.gray, 'gpos75':Qt.darkGray,'gvar':Qt.white,'stalk':Qt.red} #Every item in colorTab, if not a cytoband file, contains 4 items: chromosome name, startPos, endPos, color #If a cytoband file, use the stain name to determine color self.regionItems = [] for region in colorTab: #Find a matching chromosome item for every region and make sure it's displayed for chromo in self.chromosomes: if not chromo.display: continue if chromo.name == region[0]: #where on the circle does this chromosome start, how much does it span? index = self.chromosomes.index(chromo) startAngle = self.chromosome_angle_list[index][0] angleSpan = self.chromosome_angle_list[index][1] #the region starts and ends at certain points in this span

Page 57: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

regionStart = int(region[1]) regionEnd = int(region[2]) #if the files are slightly misaligned, set maximum end to chromo end if regionEnd > int(chromo.end): regionEnd = int(chromo.end) regionStartAngle = startAngle + (regionStart/int(chromo.end))*angleSpan regionEndAngle = startAngle + (regionEnd/int(chromo.end))*angleSpan #Define two painter paths constructing circle sectors outer = QPainterPath() inner = QPainterPath() outer.moveTo(outerChrRect.center()) outer.arcTo(outerChrRect,-regionStartAngle, -(regionEndAngle-regionStartAngle)) inner.moveTo(innerChrRect.center()) inner.arcTo(innerChrRect,-regionStartAngle, -(regionEndAngle-regionStartAngle)) #Removes any leftover painting path that may cause ugly lines in the middle leftoverArea = QPainterPath() leftoverArea.moveTo(innerChrRect.center()) leftoverArea.arcTo(innerChrRect,0,360) #Remove the inner circle sector from the outer sector to get the area to display regionPath = outer.subtracted(inner) regionPath = regionPath.subtracted(leftoverArea) regionItem = QGraphicsPathItem(regionPath) if cytoband: regionColor = stainColors[region[4]] else: regionColor = colors[region[3]] regionItem.setBrush(regionColor) regionItem.setOpacity(1) #Add the finished graphics item to a list self.regionItems.append(regionItem) for regionItem in self.regionItems: self.scene().addItem(regionItem) def initscene(self): #Clear old chromosome items, coverage, connections try: self.scene().removeItem(self.completeCoveragePathItem) except: pass for chrItem in self.chromosomeItems: #Update the color dict in case user modified these self.chromoColors[chrItem.nameString] = chrItem.brush().color() self.scene().removeItem(chrItem) for index in range(len(self.chromosomes)): for connItem in self.chromosomes[index].connection_list: self.scene().removeItem(connItem[0]) for regionItem in self.regionItems: self.scene().removeItem(regionItem) self.scene().markedChromItems = [] self.chromosomeItems = [] self.coverageItems = [] for index in range(len(self.chromosomes)):

Page 58: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

self.chromosomes[index].connection_list = [] self.chromosome_angle_list = [None]*24 #Create new graphics items, add these to the scene. self.makeItems() self.createCoverage() self.drawConnections() for chrItem in self.chromosomeItems: self.scene().addItem(chrItem) for index in range(len(self.chromosomes)): for connItem in self.chromosomes[index].connection_list: self.scene().addItem(connItem[0]) #For more convenient coloring, create a new graphics item consisting of all coverages added together completeCoveragePath = QPainterPath() for covItem in self.coverageItems: completeCoveragePath.addPath(covItem.path()) self.completeCoveragePathItem = QGraphicsPathItem(completeCoveragePath) #We then create a gradient with short interpolation distances, based on #the rectangles used for defining coverage items size = self.size() outRect = QRect(QPoint(125,125),QPoint(size.height()-125,size.height()-125)) inRect = QRect(QPoint(150,150),QPoint(size.height()-150,size.height()-150)) gradRadius = outRect.width()/2 radialGrad = QRadialGradient(outRect.center(), gradRadius) diff = outRect.width()/2 - inRect.width()/2 #In setColorAt, 0 is the circle center, 1 is the edge. #The coverage graph reaches from a radius of 1, to 1-diff/gradRadius, in these coordinates. #We use two stops for a color switch, placed around thirds of coverage graph reach. radialGrad.setColorAt(1,Qt.red) radialGrad.setColorAt(1-diff/gradRadius*(1/3.1),Qt.red) radialGrad.setColorAt(1-diff/gradRadius*(1/3),Qt.black) radialGrad.setColorAt(1-diff/gradRadius*(2/3),Qt.black) radialGrad.setColorAt(1-diff/gradRadius*(2.1/3),Qt.blue) #Create a pen with a brush using the gradient, tell the graphic item to use the pen, add to scene. covBrush = QBrush(radialGrad) covPen = QPen() covPen.setBrush(covBrush) self.completeCoveragePathItem.setPen(covPen) self.scene().addItem(self.completeCoveragePathItem) self.update() #Adds the VCF and TAB file names as text items to the top of the scene def addFileText(self): tabText = self.scene().addText("TAB File: " + self.tabName) tabText.setFlag(QGraphicsItem.ItemIsMovable) tabText.setTextInteractionFlags(Qt.TextEditorInteraction) vcfText = self.scene().addText("VCF File: " + self.vcfName) vcfText.setPos(0,0+tabText.boundingRect().height()) vcfText.setFlag(QGraphicsItem.ItemIsMovable) vcfText.setTextInteractionFlags(Qt.TextEditorInteraction) #Subclass of graphics path item for custom handling of mouse events

Page 59: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

class ChromoGraphicItem(QGraphicsPathItem): def __init__(self,path,nameString): super().__init__(path) self.selected = False self.nameString = nameString self.setData(0,"CustomSelection") self.setPen(QPen(Qt.black,1)) #Marks the chromosome item with a blue outline if selected def mark(self): currentPen = self.pen() currentPen.setStyle(Qt.DashLine) currentPen.setBrush(Qt.blue) currentPen.setWidth(3) self.setPen(currentPen) self.selected = True def unmark(self): self.setPen(QPen(Qt.black,1)) self.selected = False #Paints the name of the chromosone in the middle of the item -- possible to implemend changing of font etc if needed def paint(self,painter,option,widget): super().paint(painter,option,widget) painter.drawText(self.path().boundingRect().center(),self.nameString) #Subclass of graphics scene for custom handling of mouse events class CircosScene(QGraphicsScene): def __init__(self, parent): super().__init__(parent) self.markedChromItems = [] #Modified slightly for different selection behaviour (no default borders etc) def mousePressEvent(self,event): leftClickPos = event.buttonDownScenePos(Qt.LeftButton) clickedItems = self.items(leftClickPos) for item in clickedItems: if not item.isEnabled(): continue #Items with custom selection behavior have custom data #These should have their own handling; other items go through the default implementation if (item.data(0) == "CustomSelection"): if item.selected: item.unmark() self.markedChromItems.remove(item) else: item.mark() self.markedChromItems.append(item) else:

Page 60: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

QGraphicsScene.mousePressEvent(self,event) #Opens a context menu on right click def contextMenuEvent(self,event): self.lastContextPos = event.scenePos() if self.markedChromItems: menu = QMenu() setColorAct = QAction('Set color of selected chromosomes',self) setColorAct.triggered.connect(self.setColor) menu.addAction(setColorAct) menu.exec_(QCursor.pos()) else: menu = QMenu() addSceneTextAct = QAction('Insert text',self) addSceneTextAct.triggered.connect(self.addSceneText) addGeneLabelAct = QAction('Add gene label',self) addGeneLabelAct.triggered.connect(self.addGeneLabel) menu.addAction(addSceneTextAct) menu.addAction(addGeneLabelAct) menu.exec_(QCursor.pos()) #Opens a color pick dialog, and sets chromosome item(s) to this color. #Several items can be marked, use color of first item as default in that case def setColor(self): if self.markedChromItems: initialColor = self.markedChromItems[0].brush().color() chosenColor = QColorDialog.getColor(initialColor) for item in self.markedChromItems: item.setBrush(chosenColor) item.unmark() self.markedChromItems = [] def addSceneText(self): (text, ok) = QInputDialog.getText(None, 'Insert text', 'Text:') if ok and text: textItem = QGraphicsTextItem(text) textItem.setPos(self.lastContextPos) textItem.setFlag(QGraphicsItem.ItemIsMovable) textItem.setTextInteractionFlags(Qt.TextEditorInteraction) self.addItem(textItem) def addGeneLabel(self): #Adds a label item, with user set chromosome, location, text, and color. #Currently only adds a graphic for the label, but should automatically draw a line from the item, #to specified chromosome and position. Needs input check.. labelDialog = QDialog() labelDialog.setWindowTitle("Add label") applyButton = QPushButton('Ok', labelDialog) applyButton.clicked.connect(labelDialog.accept) chromoBox = QComboBox() chromoStrings = [chromo.name for chromo in self.views()[0].chromosomes if chromo.display] chromoBox.addItems(chromoStrings)

Page 61: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

locBox = QLineEdit() locBoxValidator = QIntValidator(self) locBoxValidator.setBottom(0) locBox.setValidator(locBoxValidator) textBox = QLineEdit() colorBox = QComboBox() colorStrings = QColor.colorNames() colorBox.addItems(colorStrings) chrLabel = QLabel("Add label for chromosome: ") locLabel = QLabel("Label location: ") geneLabel = QLabel("Label text: ") colorLabel = QLabel("Label color: ") labelDialog.layout = QGridLayout(labelDialog) labelDialog.layout.addWidget(chrLabel,0,0) labelDialog.layout.addWidget(chromoBox,0,1) labelDialog.layout.addWidget(locLabel,1,0) labelDialog.layout.addWidget(locBox,1,1) labelDialog.layout.addWidget(geneLabel,2,0) labelDialog.layout.addWidget(textBox,2,1) labelDialog.layout.addWidget(colorLabel,3,0) labelDialog.layout.addWidget(colorBox,3,1) labelDialog.layout.addWidget(applyButton,4,0) choice = labelDialog.exec_() if choice == QDialog.Accepted: textItem = QGraphicsTextItem(textBox.text()) rectItem = QGraphicsRectItem(textItem.boundingRect()) rectItem.setBrush(QColor(colorBox.currentText())) self.addItem(rectItem) self.addItem(textItem) labelItem = self.createItemGroup([rectItem,textItem]) labelItem.setFlag(QGraphicsItem.ItemIsMovable) labelItem.setPos(self.lastContextPos)

Page 62: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Data.py import sys import math class Reader(): def __init__(self): self.numChr = 0 self.chromosomes = [] self.totalBP = 0 self.vcfInfoLines = [] self.coverageNorm = 0 self.coverageNormLog = 0 self.colorTabInfo = [] self.cytoTabInfo = [] #Reads a tab file with name string given by toRead. #Constructs a list of chromosome items, one per chromosome, and inserts #chromosome name, start bp, end bp, coverage per 1000 bp in these items. def readTab(self,toRead): totalReadLines = 0 self.tabFileName = toRead with open(toRead, 'r') as tab: #Read the first line in the file, should start with #CHR. line = tab.readline() if (not line.startswith("#CHR")): print('TAB file is not in correct format') return -1 else: print('TAB file seems ok, continuing read') #Read the second line and create the first Chromosome object. #All following lines should be formatted as: chrName\tstart\tend\tcoverage line = tab.readline() fields = line.split('\t') if (not len(fields) == 4): print("TAB file not formatted correctly on line 2") return -1 else: curChrName = fields[0] chrom = Chromosome(curChrName, fields[1]) chrom.addCoverage(float(fields[3])) self.coverageNorm += float(fields[3]) if(float(fields[3])) > 0:

Page 63: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

self.coverageNormLog += math.log(float(fields[3]),2) totalReadLines += 1 lastRead = line self.chromosomes.append(chrom) self.numChr += 1 #Iterate over the rest of the lines in the file for line in tab: fields = line.split('\t') if (not len(fields) == 4): print("TAB file not formatted correctly") return -1 #If we come across a new chromosome, assign end on current chromosome (contained in last read line) #Then create a new Chromosome object, assign name & start. Add to list. if fields[0] != curChrName: chrom.setEnd(lastRead.split('\t')[2]) curChrName = fields[0] chrom = Chromosome(curChrName, fields[1]) self.chromosomes.append(chrom) self.numChr += 1 #Every line contains coverage data of interest, for current chromosome chrom.addCoverage(float(fields[3])) self.coverageNorm += float(fields[3]) if(float(fields[3])) > 0: self.coverageNormLog += math.log(float(fields[3]),2) totalReadLines += 1 #Store last read line and go to next line lastRead = line chrom.setEnd(lastRead.split('\t')[2]) self.coverageNorm = self.coverageNorm / totalReadLines self.coverageNormLog = self.coverageNormLog / totalReadLines #DEBUG: print read data (and also sum total bp, move elsewhere later maybe) #print("%d chromosomes read: " % (self.numChr)) for i in range(0,self.numChr): self.totalBP += int(self.chromosomes[i].end) curChr = self.chromosomes[i] #print("Name: %s\nStart: %s\nEnd: %s\n" % (curChr.name,curChr.start,curChr.end)) #print("Total base pairs: %d" % (self.totalBP)) return 1 def returnTotalBP(self): return self.totalBP def returnChrList(self): return self.chromosomes def returnCoverageNorm(self): return self.coverageNorm

Page 64: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

def returnCoverageNormLog(self): return self.coverageNormLog def readColorTab(self, toRead): self.colorTabName = toRead with open(toRead, 'r') as tab: #Read the first line in the file, should start with #chromosome. line = tab.readline() if (not line.startswith("#chromosome")): print('TAB file is not in correct format') return -1 else: print('TAB file seems ok, continuing read') #The fields are as follwing: #chromosome, startPos, endPos, color for line in tab: fields = line.split('\t') fields[3] = fields[3].strip('\n') colorTab = [fields[0], fields[1], fields[2], fields[3]] self.colorTabInfo.append(colorTab) def returnColorTab(self): return self.colorTabInfo def readCytoTab(self, toRead): self.cytoTabName = toRead with open(toRead, 'r') as tab: #Read the first line in the file, should start with #chromosome. line = tab.readline() if (not line.startswith("chr1")): print('TAB file is not in correct format') return -1 else: print('CytoBandTAB file seems ok, continuing read') #The fields are as following: #chromosome, startPos, endPos, cytoband, stain value fields = line.split('\t') fields[0] = fields[0].strip('chr') fields[4] = fields[4].strip('\n') cytoTab = [fields[0], fields[1], fields[2], fields[3], fields[4]] self.cytoTabInfo.append(cytoTab) for line in tab: fields = line.split('\t') fields[0] = fields[0].strip('chr') fields[4] = fields[4].strip('\n') cytoTab = [fields[0], fields[1], fields[2], fields[3], fields[4]] self.cytoTabInfo.append(cytoTab) def returnCytoTab(self): return self.cytoTabInfo

Page 65: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

#Reads a vcf file with name string given by toRead. #Stores meta information lines and inserts variants located in a chromosome #into corresponding chromosome items created by reading a tab file. #note that tab file has to have been read first. def readVCF(self, toRead): self.vcfFileName = toRead with open(toRead, 'r') as vcf: #The first lines should be a number of meta-information lines, prepended by ##. #Should begin with fileformat. Store these. Check first line for correct format. line = vcf.readline() if (not line.startswith("##fileformat=")): print("VCF file is not in correct format") return -1 else: print("VCF file seems ok, continuing read") while (line.startswith("##")): self.vcfInfoLines.append(line) line = vcf.readline() #A header line prepended by # should follow containing 8 fields, tab-delimited. #These are in order CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO. Store in info line list. fields = line.split('\t') if (not (line.startswith('#') or len(fields) == 8) ): print("Header columns missing in VCF file") return -1 else: self.vcfInfoLines.append(line) #All following lines are tab-delmited data lines. #Store variant data in chromosome item corresponding to CHROM field numvars = 0 for line in vcf: numvars += 1 fields = line.split('\t') chromRefName = fields[0] #Iterate through chromosome list to find match to insert data into for chromo in self.chromosomes: if chromo.name == chromRefName: #The ALT field is stripped of its '< >' if fields[4].startswith('<'): fields[4] = fields[4].strip('<') fields[4] = fields[4].strip('>') #the INFO field is here processed, looking for the END and GENE data points list = fields[7].split(';') END = '.' for index in range(len(list)): if list[index].startswith('END'): END = list[index].split('=')[1] if list[index].startswith('CSQ'): #The CSQ field has several sub-fields, each separated with ',' sub_list = list[index].split(',')

Page 66: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

geneList = [] for sub_index in range(len(sub_list)): #The gene name field is always the fourth element in the CSQ field separated with '|' sub_sub_list = sub_list[sub_index].split('|') geneList.append(sub_sub_list[3]) #Convert the list to a set to remove any duplicates geneSet = set(geneList) s = ', ' GENES = s.join(geneSet) if list[index].startswith('CYTOBAND'): sub_list = list[index].split('=') #CBAND = sub_list[1].split(',')[0] CBAND = sub_list[1] chromo.addVariant(fields[1],fields[4], fields[7],END,GENES,CBAND) break #DEBUG: print where variants are found, how many #print("Found %d variants:" % (numvars)) #for chromo in self.chromosomes: # print("%d in chromosome " % (len(chromo.variants)) + chromo.name) return 1 def returnVCFHeader(self): #The last element of information lines should be the header, if we have read a VCF file. #Returns an empty list if vcfInfoLines is not populated, -1 if a vcf file has not been read. if self.vcfInfoLines: return self.vcfInfoLines[-1] else: return -1 def returnVCFInfo(self): return self.vcfInfoLines def returnTabName(self): return self.tabFileName def returnVcfName(self): return self.vcfFileName class Chromosome(): def __init__(self, name, start): self.name = name self.start = start self.coverage = [] self.coverageLog = [] self.display = True self.variants = [] self.connections = [] self.connection_list = []

Page 67: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

self.display_connections = False self.display_cytoBandNames = False def addCoverage(self, coverageValue): self.coverage.append(coverageValue) if(coverageValue > 0): self.coverageLog.append(math.log(coverageValue,2)) else: self.coverageLog.append(0) def setEnd(self,end): self.end = end def addVariant(self,pos,alt,info,end,gene,cband): variant = [pos,alt,info,end,gene,cband] self.variants.append(variant) def createConnections(self): #Checks if the fourth field in the variant array i.e. the "ALT" field starts with "N", this implies the variant has an interaction #These corresponding values for the variant are then added to the list: CHRA,CHRB,WINA,WINB,CBANDS for variant in self.variants: if variant[1].startswith('N'): list = variant[2].split(';') connection = [list[1].split('=')[1], list[3].split('=')[1], list[2].split('=')[1], list[4].split('=')[1], list[16].split('=')[1]] self.connections.append(connection)

Page 68: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Karyogram.py from PySide.QtCore import * from PySide.QtGui import * class KaryogramView(QGraphicsView): def __init__(self,chromosomes,cytoInfo): self.scene = QGraphicsScene() super().__init__(self.scene) self.chromosomes = chromosomes self.cytoInfo = cytoInfo self.numDispChromos = 24 self.cytoGraphicItems = {} self.connectionGraphicItems = [] self.setRenderHints(QPainter.Antialiasing) self.resize(800,600) self.show() #create a list of stain names, to be able to set their colors later.. self.stainNames = ['acen','gneg','gpos100','gpos25','gpos50','gpos75','gvar','stalk'] self.colors = {'acen':Qt.darkRed, 'gneg':Qt.white,'gpos100':Qt.black,'gpos25':Qt.lightGray,'gpos50':Qt.gray, 'gpos75':Qt.darkGray,'gvar':Qt.white,'stalk':Qt.red} self.createSettings() self.updateItems() def createSettings(self): self.settingsModel = QStandardItemModel() #create header labels to distinguish different settings. verticalHeaders = [] self.settingsModel.setVerticalHeaderLabels(verticalHeaders) self.settingsModel.itemChanged.connect(self.updateSettings) self.colorModel = QStandardItemModel() stainItems = [] colorItems = [] for stainName in self.stainNames: stainItem = QStandardItem(stainName) stainItem.setEditable(False) stainItem.setSelectable(False) stainItems.append(stainItem) colorItem = QStandardItem() colorItem.setBackground(self.colors[stainName]) colorItem.setEditable(False) colorItem.setSelectable(False)

Page 69: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

colorItems.append(colorItem) self.colorModel.appendColumn(stainItems) self.colorModel.appendColumn(colorItems) def updateSettings(self,item): pass def viewSettings(self): self.settingsList = QTableView() self.settingsList.setEditTriggers(QAbstractItemView.AllEditTriggers) self.settingsList.setShowGrid(False) self.settingsList.horizontalHeader().hide() self.settingsList.verticalHeader().hide() self.settingsList.setModel(self.settingsModel) self.settingsList.setTextElideMode(Qt.ElideNone) self.colorList = QTableView() self.colorList.setShowGrid(False) self.colorList.horizontalHeader().hide() self.colorList.verticalHeader().hide() self.colorList.setModel(self.colorModel) self.colorList.doubleClicked.connect(self.pickColor) self.settingsDia = QDialog(self) self.settingsDia.setWindowTitle("Settings") applyButton = QPushButton('Apply', self.settingsDia) applyButton.clicked.connect(self.settingsDia.accept) self.settingsDia.layout = QGridLayout(self.settingsDia) self.settingsDia.layout.addWidget(self.settingsList,0,0,1,2) self.settingsDia.layout.addWidget(self.colorList,0,2,1,2) self.settingsDia.layout.addWidget(applyButton,1,0,1,1) self.settingsDia.show() #Creates data model for info window def createChInfo(self): self.chModel = QStandardItemModel() topstring = ["Name","Length","No. of variants","Display","Draw connections", "Cyto band names"] self.chModel.setHorizontalHeaderLabels(topstring) for chromo in self.chromosomes: infostring = [chromo.name,chromo.end,str(len(chromo.variants))] infoItems = [QStandardItem(string) for string in infostring] dispCheckItem = QStandardItem() dispCheckItem.setCheckable(False) connCheckItem = QStandardItem() connCheckItem.setCheckable(False) connCheckItem.setCheckState(Qt.Unchecked) cytoCheckItem = QStandardItem() cytoCheckItem.setCheckable(False) cytoCheckItem.setCheckState(Qt.Unchecked) checkList = [dispCheckItem, connCheckItem, cytoCheckItem] infoItems.extend(checkList)

Page 70: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

if (self.chromosomes.index(chromo) < 24): dispCheckItem.setCheckState(Qt.Checked) self.chModel.appendRow(infoItems) else: chromo.display = False #Creates a window with chromosomes and toggles, info def showChInfo(self): #if any earlier window is open, close it try: self.chDia.close() except: pass self.chList = QTableView() self.chList.verticalHeader().hide() self.chList.setSelectionMode(QAbstractItemView.ExtendedSelection) self.chList.setSelectionBehavior(QAbstractItemView.SelectRows) self.chList.setEditTriggers(QAbstractItemView.NoEditTriggers) self.chList.setShowGrid(False) self.chList.setModel(self.chModel) self.chList.resizeColumnsToContents() #Give the length column some extra space.. curWidth = self.chList.columnWidth(1) self.chList.setColumnWidth(1,curWidth+20) self.chDia = QDialog(self) self.chDia.setWindowTitle("Chromosome info") #Button for toggling display of selected chromosomes in the scene togButton = QPushButton('Toggle display', self.chDia) togButton.clicked.connect(self.toggleDisp) #Button for viewing selected chromosome variants viewVarButton = QPushButton('View variants', self.chDia) viewVarButton.clicked.connect(self.viewVariants) #Button for toggling connections connButton = QPushButton('Toggle connections', self.chDia) connButton.clicked.connect(self.toggleConnections) #Button for toggling cyto band names cytoButton = QPushButton('Toggle cyto band names', self.chDia) cytoButton.clicked.connect(self.toggleBandNames) self.chDia.layout = QGridLayout(self.chDia) self.chDia.layout.addWidget(self.chList,0,0,1,4) self.chDia.layout.addWidget(togButton,1,0,1,1) self.chDia.layout.addWidget(viewVarButton,1,1,1,1) self.chDia.layout.addWidget(connButton,1,2,1,1) self.chDia.layout.addWidget(cytoButton,1,3,1,1) self.chDia.setMinimumSize(700,400) self.chDia.show() #Creates data model for variants in given chromosome def createVariantInfo(self, chromo): self.varModel = QStandardItemModel() topstring = ['START', 'ALT', 'END', 'GENE(S)', 'CYTOBAND'] self.varModel.setHorizontalHeaderLabels(topstring)

Page 71: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

#Adding variant info to a list (except the info field, which has index=2 in the variant list) for variant in chromo.variants: infoitem = [] infoitem.append(QStandardItem(variant[0])) infoitem.append(QStandardItem(variant[1])) infoitem.append(QStandardItem(variant[3])) infoitem.append(QStandardItem(variant[4])) infoitem.append(QStandardItem(variant[5])) self.varModel.appendRow(infoitem) #Creates a popup containing variant info in a table. #Could be implemented in a better way than multiple dialogues.. def viewVariants(self): selectedIndexes = self.chList.selectedIndexes() selectedRows = [index.row() for index in selectedIndexes] selectedRows = set(selectedRows) for row in selectedRows: chromo = self.chromosomes[row] self.createVariantInfo(chromo) viewVarDia = QDialog(self) viewVarDia.setWindowTitle("Variants in contig " + chromo.name) varList = QTableView() varList.setMinimumSize(440,400) varList.verticalHeader().hide() varList.setEditTriggers(QAbstractItemView.NoEditTriggers) varList.setModel(self.varModel) varList.resizeColumnToContents(1) viewVarDia.layout = QGridLayout(viewVarDia) viewVarDia.layout.addWidget(varList,0,0) viewVarDia.show() def toggleDisp(self): #The row associated with the item corresponds to a chromosome #row 1 is chr 1, row 2 is chr2 ... 23 is x, 24 is y and so on #which corresponds to index 0, 2 ... 22, 23 in list of chromosomes selectedIndexes = self.chList.selectedIndexes() selectedRows = [index.row() for index in selectedIndexes] #Convert to a set to get unique rows, since every column in the table is selected selectedRows = set(selectedRows) for row in selectedRows: dispConnItem = self.chModel.item(row,4) dispItem = self.chModel.item(row,3) if (dispItem.checkState() == Qt.Checked): dispItem.setCheckState(Qt.Unchecked) self.chromosomes[row].display = False dispConnItem.setCheckState(Qt.Unchecked) self.chromosomes[row].display_connections = False self.numDispChromos -= 1 else: dispItem.setCheckState(Qt.Checked) self.chromosomes[row].display = True self.numDispChromos += 1

Page 72: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

self.updateItems() def toggleConnections(self): selectedIndexes = self.chList.selectedIndexes() selectedRows = [index.row() for index in selectedIndexes] selectedRows = set(selectedRows) for row in selectedRows: dispConnItem = self.chModel.item(row,4) if self.chromosomes[row].display_connections: dispConnItem.setCheckState(Qt.Unchecked) self.chromosomes[row].display_connections = False else: dispConnItem.setCheckState(Qt.Checked) self.chromosomes[row].display_connections = True self.updateItems() def toggleBandNames(self): selectedIndexes = self.chList.selectedIndexes() selectedRows = [index.row() for index in selectedIndexes] selectedRows = set(selectedRows) for row in selectedRows: dispCytoName = self.chModel.item(row,5) if self.chromosomes[row].display_cytoBandNames: dispCytoName.setCheckState(Qt.Unchecked) self.chromosomes[row].display_cytoBandNames = False else: dispCytoName.setCheckState(Qt.Checked) self.chromosomes[row].display_cytoBandNames = True self.updateItems() def drawConnections(self): self.connectionGraphicItems = [] #Loops through the full list of chromosomes and checks if the connections should be displayed or not for chrA in self.chromosomes: if not (chrA.display_connections and chrA.display): continue #only create the connection list if it has not been initialized earlier if not chrA.connections: chrA.createConnections() for connection in chrA.connections: #The information is stored as string elements and needs to be converted to integers if connection[1] == 'X': chrBIndex=22 chrB = self.chromosomes[chrBIndex] elif connection[1] == 'Y': chrBIndex=23 chrB = self.chromosomes[chrBIndex] elif (connection[1].startswith('G') or connection[1].startswith('M')): continue else: chrB = self.chromosomes[int(connection[1])-1]

Page 73: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

if not chrB.display: continue #The cytobands which the connections will go between are gathered cbandA = connection[4].split(',')[0] cbandB = connection[4].split(',')[1] #The x-positions are accessed, the chromosome A x-pos has the chromosome width added to it. This will make the connection start on its right side xPosA = self.cytoGraphicItems[chrA.name].boundingRect().x() + self.cytoGraphicItems[chrA.name].boundingRect().width() xPosB = self.cytoGraphicItems[chrB.name].boundingRect().x() #Find the y position of the actual cytoband in each chromosome, by accessing the chromosome band dicts cBandAItem = self.cytoGraphicItems[chrA.name].bandItemsDict[cbandA] cBandBItem = self.cytoGraphicItems[chrB.name].bandItemsDict[cbandB] yPosA = cBandAItem.rect().top() + cBandAItem.rect().height() / 2 yPosB = cBandBItem.rect().top() + cBandBItem.rect().height() / 2 #If the item has been moved, x and y are how much the item has been moved by; update position with these xPosA += self.cytoGraphicItems[chrA.name].x() xPosB += self.cytoGraphicItems[chrB.name].x() yPosA += self.cytoGraphicItems[chrA.name].y() yPosB += self.cytoGraphicItems[chrB.name].y() pointA = QPoint(xPosA, yPosA) pointB = QPoint(xPosB, yPosB) connectionPath = QPainterPath() connectionPath.moveTo(pointA) connectionPath.lineTo(pointB) connectionItem = QGraphicsPathItem(connectionPath) #Set the color of the line to chrB's stain color (makes it difficult to distinguish though..) pen = QPen() #pen.setBrush(cBandBItem.brush()) pen.setBrush(Qt.darkYellow) connectionItem.setPen(pen) self.scene.addItem(connectionItem) self.connectionGraphicItems.append(connectionItem) def pickColor(self,modelIndex): if modelIndex.column() == 1: selectedRow = modelIndex.row() stainItem = self.colorModel.item(selectedRow,0) colorItem = self.colorModel.item(selectedRow,1) chosenColor = QColorDialog.getColor(colorItem.background().color()) self.colors[stainItem.text()] = chosenColor colorItem.setBackground(chosenColor) #Create chromosome items consisting of cytobands, names of bands, and chromosome names def createChromosomeItems(self): if self.numDispChromos > 0: size = self.size() containerRect = QRect(QPoint(50,50), QPoint(size.width()-50,size.height()-50)) #find the maximum displayed chromosome length, and let this be 100% of item length

Page 74: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

maxBp = 0 for chromo in self.chromosomes: if not (chromo.display or "GL" in chromo.name or "MT" in chromo.name): continue if int(chromo.end) > maxBp: maxBp = int(chromo.end) #Lays out items vetically with equal spacing between each other, with a width depending on screen size currentXPosition = containerRect.left() xIncrement = containerRect.width() / self.numDispChromos self.chromoWidth = containerRect.width() / 48 #Create the graphic items for each chromosome if they are set to be displayed for chromo in self.chromosomes: if not chromo.display or "GL" in chromo.name or "MT" in chromo.name: continue chromoHeight = (int(chromo.end)/maxBp)*(containerRect.height()) bandItems = [] textItems = [] placeLeft = True #Find each cytoband for this chromosome, and create band items using this data for cyto in self.cytoInfo: if cyto[0] == chromo.name: cytoStart = int(cyto[1]) cytoEnd = int(cyto[2]) totalCytoBP = cytoEnd-cytoStart bandHeight = (totalCytoBP / int(chromo.end)) * (chromoHeight) bandYPos = (cytoStart / int(chromo.end)) * (chromoHeight) bandXPos = currentXPosition bandWidth = self.chromoWidth #Create a rect item with corresponding stain color, tooltip, set data to band name for later use bandRectItem = QGraphicsRectItem(bandXPos,bandYPos,bandWidth,bandHeight) bandRectItem.setBrush(self.colors[cyto[4]]) bandRectItem.setToolTip(cyto[3] + ": " + str(totalCytoBP) + " bp") bandRectItem.setData(0,cyto[3]) self.scene.addItem(bandRectItem) bandItems.append(bandRectItem) if chromo.display_cytoBandNames: bandNameItem = QGraphicsTextItem(cyto[3]) nameXPosition = bandRectItem.rect().left()-bandRectItem.boundingRect().width() if placeLeft else bandRectItem.rect().right() bandNameItem.setPos(nameXPosition,bandRectItem.rect().top()) self.scene.addItem(bandNameItem) textItems.append(bandNameItem) placeLeft = not placeLeft chromoNameItem = QGraphicsTextItem(chromo.name) chromoNameItem.setPos(currentXPosition,chromoHeight) chromoNameItem.setScale(0.8) self.scene.addItem(chromoNameItem) textItems.append(chromoNameItem)

Page 75: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

#Create a custom graphic item group from created items, enter in dict cytoItem = KaryoGraphicItem(bandItems,textItems,chromo.name) self.cytoGraphicItems[chromo.name] = cytoItem currentXPosition += xIncrement self.scene.addItem(cytoItem) def updateItems(self): self.scene.clear() self.createChromosomeItems() self.drawConnections() self.update() def updateConnections(self): for item in self.connectionGraphicItems: self.scene.removeItem(item) self.drawConnections() self.update() #Opens a context menu on right click def contextMenuEvent(self,event): self.lastContextPos = event.pos() menu = QMenu() addSceneTextAct = QAction('Insert text',self) addSceneTextAct.triggered.connect(self.addSceneText) addLabelAct = QAction('Add label',self) addLabelAct.triggered.connect(self.addLabel) menu.addAction(addSceneTextAct) menu.addAction(addLabelAct) menu.exec_(QCursor.pos()) def addSceneText(self): (text, ok) = QInputDialog.getText(None, 'Insert text', 'Text:') if ok and text: textItem = QGraphicsTextItem(text) textItem.setPos(self.lastContextPos) textItem.setFlag(QGraphicsItem.ItemIsMovable) textItem.setTextInteractionFlags(Qt.TextEditorInteraction) self.scene.addItem(textItem) def addLabel(self): #Adds a label item labelDialog = QDialog() labelDialog.setWindowTitle("Add label") applyButton = QPushButton('Ok', labelDialog) applyButton.clicked.connect(labelDialog.accept) textBox = QLineEdit() colorBox = QComboBox() colorStrings = QColor.colorNames() colorBox.addItems(colorStrings) textLabel = QLabel("Label text: ") colorLabel = QLabel("Label color: ") labelDialog.layout = QGridLayout(labelDialog)

Page 76: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

labelDialog.layout.addWidget(textLabel,0,0) labelDialog.layout.addWidget(textBox,0,1) labelDialog.layout.addWidget(colorLabel,1,0) labelDialog.layout.addWidget(colorBox,1,1) labelDialog.layout.addWidget(applyButton,2,0) choice = labelDialog.exec_() if choice == QDialog.Accepted: textItem = QGraphicsTextItem(textBox.text()) rectItem = QGraphicsRectItem(textItem.boundingRect()) rectItem.setBrush(QColor(colorBox.currentText())) self.scene.addItem(rectItem) self.scene.addItem(textItem) labelItem = self.scene.createItemGroup([rectItem,textItem]) labelItem.setFlag(QGraphicsItem.ItemIsMovable) labelItem.setPos(self.lastContextPos) def mouseMoveEvent(self,event): QGraphicsView.mouseMoveEvent(self,event) if event.buttons() == Qt.LeftButton and self.scene.mouseGrabberItem(): movedItem = self.scene.mouseGrabberItem() if movedItem.data(1) == 'karyoItem': self.updateConnections() #Custom graphics group class for more convenient handling of cytoband items class KaryoGraphicItem(QGraphicsItemGroup): def __init__(self,bandItems,textItems,nameString): super().__init__() self.setData(1,'karyoItem') #Go through the band items and add them to a dict, with key as band name self.bandItemsDict = {} for bandItem in bandItems: self.bandItemsDict[bandItem.data(0)] = bandItem bandItem.setData(1,'karyoItem') self.addToGroup(bandItem) for textItem in textItems: textItem.setData(1,'karyoItem') self.addToGroup(textItem) self.nameString = nameString self.setFlag(QGraphicsItem.ItemIsMovable)

Page 77: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

Coverage.py from PySide.QtCore import * from PySide.QtGui import * import matplotlib.pyplot as plt import matplotlib.colors as colors import numpy as np from matplotlib.figure import Figure from matplotlib.backend_bases import key_press_handler from matplotlib.colors import ListedColormap, BoundaryNorm from matplotlib.collections import LineCollection from matplotlib.backends.backend_qt4agg import ( FigureCanvasQTAgg as FigureCanvas, NavigationToolbar2QT as NavigationToolbar) class CoverageView(QWidget): def __init__(self,chromosomes): super().__init__() self.chromosomes = chromosomes self.subWindows = [] self.grid = QGridLayout() self.setLayout(self.grid) self.maxColumns = 2 self.bpWindow = 50 self.minCoverage = 0 self.maxCoverage = 5 self.createSettings() #Adds a subwindow containing a matplotlib widget to the grid layout def addChromoPlot(self): addDialog = QDialog() addDialog.setWindowTitle("Add plot") applyButton = QPushButton('Ok', addDialog) applyButton.clicked.connect(addDialog.accept) chromoBox = QComboBox() chromoStrings = [chromo.name for chromo in self.chromosomes if not "GL" in chromo.name] chromoBox.addItems(chromoStrings) chrLabel = QLabel("Add plot for chromosome: ") typeBox = QComboBox() typeStrings = ["Line plot", "Scatter plot"] typeBox.addItems(typeStrings)

Page 78: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

typeLabel = QLabel("Plot method: ") addDialog.layout = QGridLayout(addDialog) addDialog.layout.addWidget(chrLabel,0,0) addDialog.layout.addWidget(chromoBox,0,1) addDialog.layout.addWidget(typeLabel,1,0) addDialog.layout.addWidget(typeBox,1,1) addDialog.layout.addWidget(applyButton,2,0) choice = addDialog.exec_() if choice == QDialog.Accepted: chromo = self.chromosomes[chromoBox.currentIndex()] chromoPlot = ChromoPlotWindow(chromo,typeBox.currentIndex(),self) self.subWindows.append(chromoPlot) self.arrangePlots() #Removes a plot and rearranges existing plots def removeChromoPlot(self,plot): self.subWindows.remove(plot) self.grid.removeWidget(plot) plot.destroy() self.arrangePlots() def arrangePlots(self): currentColumn = 0 currentRow = 0 for plot in self.subWindows: self.grid.addWidget(plot,currentRow,currentColumn) if currentColumn == self.maxColumns-1: currentRow += 1 currentColumn = 0 else: currentColumn += 1 self.update() def createSettings(self): self.settingsModel = QStandardItemModel() #create header labels to distinguish different settings. verticalHeaders = ["bpWindow", "minCoverage", "maxCoverage"] self.settingsModel.setVerticalHeaderLabels(verticalHeaders) bpWinText = QStandardItem("BP Resolution (kb)") bpWinText.setEditable(False) bpWinText.setToolTip("Show each data point as the average of this value (x1000 bp)") bpWinData = QStandardItem() bpWinData.setData(self.bpWindow,0) bpWinData.setEditable(True) minCovLimitText = QStandardItem("Min.coverage value (%)") minCovLimitText.setEditable(False) minCovLimitText.setToolTip("Minimum coverage value,\nin percentage of average coverage value of genome.") minCovLimitData = QStandardItem() minCovLimitData.setData(self.minCoverage*100,0) minCovLimitData.setEditable(True) maxCovLimitText = QStandardItem("Max. coverage value (%)")

Page 79: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

maxCovLimitText.setEditable(False) maxCovLimitText.setToolTip("Maximum coverage value,\nin percentage of average coverage value of genome.") maxCovLimitData = QStandardItem() maxCovLimitData.setData(self.maxCoverage*100,0) maxCovLimitData.setEditable(True) maxColumnsText = QStandardItem("Number of columns") maxColumnsText.setEditable(False) maxColumnsText.setToolTip("Number of columns to arrange diagrams in") maxColumnsData = QStandardItem() maxColumnsData.setData(self.maxColumns,0) maxColumnsData.setEditable(True) self.settingsModel.setItem(0,0,bpWinText) self.settingsModel.setItem(0,1,bpWinData) self.settingsModel.setItem(1,0,minCovLimitText) self.settingsModel.setItem(1,1,minCovLimitData) self.settingsModel.setItem(2,0,maxCovLimitText) self.settingsModel.setItem(2,1,maxCovLimitData) self.settingsModel.setItem(3,0,maxColumnsText) self.settingsModel.setItem(3,1,maxColumnsData) self.settingsModel.itemChanged.connect(self.updateSettings) def viewSettings(self): self.settingsList = QTableView() self.settingsList.setEditTriggers(QAbstractItemView.AllEditTriggers) self.settingsList.setShowGrid(False) self.settingsList.horizontalHeader().hide() self.settingsList.verticalHeader().hide() self.settingsList.setModel(self.settingsModel) self.settingsList.setTextElideMode(Qt.ElideNone) self.settingsDia = QDialog(self) self.settingsDia.setWindowTitle("Settings") applyButton = QPushButton('Apply', self.settingsDia) applyButton.clicked.connect(self.settingsDia.accept) self.settingsDia.layout = QGridLayout(self.settingsDia) self.settingsDia.layout.addWidget(self.settingsList,0,0,1,3) self.settingsDia.layout.addWidget(applyButton,1,0,1,1) self.settingsDia.show() def updateSettings(self,item): if item.row() == 0: self.bpWindow = item.data(0) if item.row() == 1: self.minCoverage = item.data(0)/100 if item.row() == 2: self.maxCoverage = item.data(0)/100 if item.row() == 3: self.maxColumns = item.data(0) #Widget containing a pyplot, plotting coverage data from given chromosome class ChromoPlotWindow(QWidget):

Page 80: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

def __init__(self,chromo,plotType,parent): super().__init__(parent) self.chromo = chromo self.setMinimumSize(500,500) self.figure = Figure(figsize=(5,2),dpi=100) self.canvas = FigureCanvas(self.figure) self.canvas.setParent(self) self.canvas.setFocusPolicy( Qt.ClickFocus ) self.canvas.setFocus() #Set true/false on toolbar to toggle coordinate display self.mpl_toolbar = NavigationToolbar(self.canvas, self, False) minLabel = QLabel("X min: ") maxLabel = QLabel("X max: ") self.minXSet = QLineEdit(self) self.maxXSet = QLineEdit(self) self.mpl_toolbar.addWidget(minLabel) self.mpl_toolbar.addWidget(self.minXSet) self.mpl_toolbar.addWidget(maxLabel) self.mpl_toolbar.addWidget(self.maxXSet) self.canvas.mpl_connect('key_press_event', self.on_key_press) self.canvas.mpl_connect('draw_event', self.updateSetLimits) self.canvas.mpl_connect('button_release_event', self.onClick) vbox = QVBoxLayout() vbox.addWidget(self.canvas) vbox.addWidget(self.mpl_toolbar) self.setLayout(vbox) self.ax = self.figure.add_subplot(111) normValue = self.parentWidget().coverageNorm minCov = normValue*self.parentWidget().minCoverage maxCov = normValue*self.parentWidget().maxCoverage coverageChunks = [chromo.coverage[i:i+self.parentWidget().bpWindow] for i in range(0,len(chromo.coverage),self.parentWidget().bpWindow)] self.coverageData = [] for chunk in coverageChunks: val = sum(chunk) / len(chunk) if val > maxCov: val = maxCov if val < minCov: val = minCov self.coverageData.append(val/normValue) #Maps colors to coverage values as follows: green: [0,0.75], blue: [0.75,1.25], red: [1.25,5] colorMap = ListedColormap(['g', 'black', 'r']) colorNorm = BoundaryNorm([0, 0.75, 1.25, 5], 3) #See the following example code for explanation http://matplotlib.org/examples/pylab_examples/multicolored_line.html points = np.array([range(len(self.coverageData)), self.coverageData]).T.reshape(-1, 1, 2) segments = np.concatenate([points[:-1], points[1:]], axis=1) #converting the coverageData list into a numpy array needed for the LineCollection numpyArrayCData = np.array(self.coverageData)

Page 81: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

lc = LineCollection(segments, cmap=colorMap, norm=colorNorm) lc.set_array(numpyArrayCData) if plotType == 0: self.ax.add_collection(lc) elif plotType == 1: self.ax.scatter(range(len(self.coverageData)),self.coverageData, c=self.coverageData, cmap= colorMap, norm=colorNorm) #Create an input validator for the manual x range input boxes, range is no of bins self.xRangeValidator = QIntValidator(0,len(self.coverageData),self) self.minXSet.setValidator(self.xRangeValidator) self.maxXSet.setValidator(self.xRangeValidator) self.minXSet.setText("0") self.maxXSet.setText(str(len(self.coverageData))) self.minXSet.returnPressed.connect(self.updateXRange) self.maxXSet.returnPressed.connect(self.updateXRange) self.ax.set_xlim(0,len(self.coverageData)) self.ax.set_ylim(minCov/normValue,maxCov/normValue) self.ax.set_title("Contig " + chromo.name) self.ax.set_xlabel("Position (x" + str(self.parentWidget().bpWindow) + " kb)") self.ax.set_ylabel("Coverage") self.canvas.updateGeometry() self.canvas.draw() def on_key_press(self, event): key_press_handler(event, self.canvas, self.mpl_toolbar) def updateXRange(self): self.ax.set_xlim(int(self.minXSet.text()),int(self.maxXSet.text())) self.canvas.draw() def updateSetLimits(self,event): xmin,xmax = self.ax.get_xlim() if xmin < 0: xmin = 0 if xmax > len(self.coverageData): xmax = len(self.coverageData) self.ax.set_xlim(xmin,xmax) self.minXSet.setText(str(int(xmin))) self.maxXSet.setText(str(int(xmax))) #Opens a context menu on ctrl+right click on a plot def onClick(self, event): if event.button == 3 and event.key == 'control': menu = QMenu() self.clickX = event.xdata self.clickY = event.ydata addPlotTextAct = QAction('Insert text',self) addPlotTextAct.triggered.connect(self.addPlotText) deletePlotAct = QAction('Delete plot',self) deletePlotAct.triggered.connect(self.deletePlot) menu.addAction(addPlotTextAct)

Page 82: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

menu.addAction(deletePlotAct) canvasHeight = int(self.figure.get_figheight()*self.figure.dpi) menu.exec_(self.mapToGlobal(QPoint(event.x,canvasHeight-event.y))) #Adds a given text to the clicked location (in data coordinates) to the plot def addPlotText(self): (text, ok) = QInputDialog.getText(None, 'Insert text', 'Text:') if ok and text: self.ax.text(self.clickX, self.clickY, text) self.canvas.draw() def deletePlot(self): self.hide() self.parentWidget().removeChromoPlot(self)

Page 83: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

App.py from PySide.QtGui import QApplication import sys import graphics if __name__ == '__main__': app = QApplication(sys.argv) mainwin = graphics.WGSView() sys.exit(app.exec_())

Page 84: Visualiseringsverktyg for data fran helgenomsekvensering951679/FULLTEXT01.pdf · Abstract Whole genome sequencing generates enormous amounts of complex data that can be di cult to

TRITA STH 2016:27

www.kth.se