what are the challenges for data science?...hincks, s., kingston, r., webb, b. and wong, c. (in...

37
What are the challenges for Data Science? Magnus Rattray Director, University of Manchester Data Science Institute Professor of Computational & Systems Biology Faculty of Biology, Medicine & Health University of Manchester www.datascience.manchester.ac.uk

Upload: others

Post on 16-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

WhatarethechallengesforDataScience?

MagnusRattrayDirector,UniversityofManchesterDataScienceInstitute

ProfessorofComputational&SystemsBiologyFacultyofBiology,Medicine&Health

UniversityofManchester

www.datascience.manchester.ac.uk

Page 2: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

TheLargeSynopticSurveyTelescope:• 3.2Gpixelcamera• 2000exposurespernight• 20TBpernight• 10yearsurvey100PBdata

Initsfirstmonthofoperation,LSSTwillsurveymoreoftheUniversethanallprevioustelescopes

Astronomy

Page 3: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Particlephysics

LargeHadronCollider(Atlasexperiment)• 1billionproton-protoncollisionseverysecond• Nominaloutputrateofdetector:68TB/s• Actualoutputratetodisk:1.5GB/s(reducedviafastidentificationof“interesting”events)

• Datarateofupto100TBperday,forupto6monthsperyear,for10-15years200PB

Page 4: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Commute-flowisabrandnewgeodemographic classification ofcommutingflowsforEnglandandWalesbasedonorigin-destinationdatafromthe2011Censusthathasbeenusedtoanalysethespatialdynamicsofcommuting.Aninteractivetoolkitis@www.commute-flow.net26milliontraveltoworkflowsrecordedin2011censusforEnglandandWales

Hincks,S.,Kingston,R.,Webb,B.andWong,C.(inpress)ANewGeodemographicClassificationofCommutingFlowsforEnglandandWales.InternationalJournalofGeographicInformationScience.

A new two-tiergeodemographictypologyofcommutingpatternswith9super-groupsandatotalof40groups.Eachincludesapenportraitwithaninteractiveflowmapandradialchart.

Geography

Page 5: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Mental health

Sport

Swimmingpool

Volleyball

1.RawGPSdata

2.Detectionofgeolocationvisited

3.Geolocationsvisited

4.Identificationofplacesvisited

5.Placesvisited

6.Typeofplacesandactivitiesrecognition

7.Out-of-homeactivities

Difrancesco et al. Out-of-home activity recognition from GPS data in schizophrenic patients. IEEE 29th International Symposium on Computer-Based Medical Systems (CBMS 2016).

Page 6: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Respiratoryhealth

Page 7: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Researchisincreasinglydata-drivenBottom-upmodelling:• Definemodelofsystemfromassumedmicroscopicprinciples• Developatractableapproximationto“solve”themodel• Exploresystempropertiesforvariousparametersettings(e.g.growthrates,stationaryproperties,phasetransitions)• Test/refine/revisethemodelgivenexperimentaldata

Data-drivenmodelling:• Identifysystemvariablesthatcanbemeasured:thedata• Fitagenerativeorpredictivestatisticalmodeltothedata• Makeinferences,learnhiddenvariables,scoremodels

Increasinglyweareconnectingtheseapproaches– allowingforstrong“mechanistic”priorknowledgewithindata-drivenmodels

Page 8: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

ChallengesforDataScience

• Bigdata– scalability• Complexdata– modelling &inference• Messydata– probability& statistics• Humandata– privacy,ethics,interaction• Accessibledata– openness,reproducibility

Page 9: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

“Datahandlingisnowthebottleneck.Itcostsmoretoanalyze agenomethantosequenceagenome.”DavidHaussler

High-throughputDNAsequencing

Example:Genomics

Page 10: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Genomics:[email protected]_11067_FC7070M:4:1:2299:1109length=50TTGCCTGCCTATCATTTTAGTGCCTGTGAGGTGGAGATGTGAGGATCAGT+SRR566546.970HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109length=50hhhhhhhhhhghhghhhhhfhhhhhfffffe`ee[`X]b[d[ed`[Y[^[email protected]_11067_FC7070M:4:1:2374:1108length=50GATTTGTATGAAAGTATACAACTAAAACTGCAGGTGGATCAGAGTAAGTC+SRR566546.971HWUSI-EAS1673_11067_FC7070M:4:1:2374:1108length=50hhhhgfhhcghghggfcffdhfehhhhcehdchhdhahehffffde`[email protected]_11067_FC7070M:4:1:2438:1109length=50TGCATGATCTTCAGTGCCAGGACCTTATCAAGCGGTTTGGTCCCTTTGTT+SRR566546.972HWUSI-EAS1673_11067_FC7070M:4:1:2438:1109length=50dhhhgchhhghhhfhhhhhdhhhhehhghfhhhchfddffcffafhfghe

200GBdatafor60xcoverageoverhumangenome20PBfor100Kgenomes

Page 11: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Royetal.Science2010

RNA-SeqTranscriptomics

Bis-Seq,ChIP-SeqEpigenomics

DNA-SeqGenomics

HiC,ChIA-PETInteractomics

Genomics:complexdata• DNAsequencingisanincrediblydisruptivetechnology• Genomicsisnotjustaboutgenomes!Many‘omics layers

Page 12: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Lister,Pelizzola etal.Nature2009

Genomics:messydata

Page 13: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

• 111reference“epigenomes”• 2804high-throughputsequencingdatasets• 1.5x1011mappedsequencereads• >1013sequencedDNAbases(>1000genomes)

Everynew‘omic layerisasbigasagenome

Page 14: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Genomic&Precisionmedicine

Precisiondiagnosis&precisiontreatment

Prognostics&Theranostics

Informingprevention

Newmodelsofcareatdisease

boundariesDrivingrapidinnovation&adoption

Roleofmulti-omics

Linking‘big’data

Re-aligningincentivesforcommiss’ng –drivenbyscience,research

Genomics– humandata

“Genomics– thechangingfaceofclinicalcare”SueHill,ChiefScientificOfficerforEngland

Page 15: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

• Life-coursecomplexityindicatesmultiple(sub-)diseases– Usuallystartsyoung– Mayprogress,remit orrelapse overlife

• Inconsistentgene-environmentinteractionsindicatesmultiple(sub-)diseases– Variableeffectsofgeneticpolymorphisms,e.g.CD14– Variabletreatment-setting interactions

Example:Asthmas StretchGenomics

Calleleassociated

Talleleassociated

Noassociation

CD14EndotoxinReceptor

SimpsonAetal.Endotoxinexposure,CD14,andallergicdisease:aninteractionbetweengenesandtheenvironment.AmJRespir Crit CareMed.2006;174(4):386-92.

50-60%heritabilityintwinstudiesbut<2%phenotype

explainedbycurrentgenomics

SlidesfromIainBuchan

Page 16: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

• ProgressionofallergyEczema →Asthma→Rhinitis

• Inferredfrompopulationsummary→

• Assumedcausal linkbetweeneczema– asthma&rhinitis

• Clinicalresponse:target childrenwitheczematoreduceprogressiontoasthma

ReceivedWisdom:AtopicMarch

Spergel &Paller,2003

WorldAllergyOrganization,2014

Page 17: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

EcologicFallacyRevealed

Belgraveetal.DevelopmentalProfilesofEczema,Wheeze,andRhinitis:TwoPopulation-BasedBirthCohortStudies.PloS Medicine2014;21;11(10):e1001748.

MRCSTELARconsortiumworkingatscaleacrossMAASandALSPACScohorts

Model-basedmachinelearning

allowingfortransitionsbetweenskin,lungandnasalallergiesovertime

Page 18: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

BetterTargetsfor‘Omics

Belgraveetal.DevelopmentalProfilesofEczema,Wheeze,andRhinitis:TwoPopulation-BasedBirthCohortStudies.PloS Medicine2014;21;11(10):e1001748.

Disambiguatediseaseprofilestomovetowardcausalmodellingandefficientidentificationof

mechanisms

Page 19: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Data TypeLarge-scale Structural Changes

Balanced Translocations

Distant Consanguinity

Uniparental Disomy

Novel / Known Coding Variants

Novel / Known Non-coding

VariantsTargetedgenesequencing û û û û ü ûSNP+arrays ûü û ü ü û ûArrayCGH* ûü û û û û ûExome ûü û ûü ûü ü ûWholeGenome ûü ü ü ü ü ü

+SingleNucleotidePolymorphism*ComparativeGenomicHybridisation

10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

10,000,000,000

0 0.5 1 1.5 2 2.5

GenotypingWholegenome3.3bnbasesBothexonsandintronsExome

10mbasesExonsonly

Panels<10mbases

Subsetofexons

“Genomics– thechangingfaceofclinicalcare”SueHill,ChiefScientificOfficerforEngland

Towardsgenomicmedicine

Page 20: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Genomics– accessibledata?

• Sequencing100,000genomesfrompatientswithcancerandrarediseases• £24mdatainfrastructureawardfromMRC• GenomicsEnglandClinicalInterpretationPartnerships(GeCIPs)toenhancevalueofdata

Page 21: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

• SequencingfacilityattheSangerCentre• 30PBdatainadatacentreonamilitarybase• Researchers(GeCIP members)willnotbeallowedtodownloadrawdatafiles

• Restrictedaccesstodataandcomputethroughsecurevirtualdesktop(Inuvika)

• Analysishastomovetothedata

Buthowdowemovethistoaglobalscale?Howdoweanalyseacrossmanydatasets?

100KGenomesProject

Page 22: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

NextGenomicRevolution:Scalingdowntosinglecells

Microfluidicssequencing/cytometry

DNA/RNA

ProteinFuidigm C1

Page 23: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Single-celldata

• Existinggenomicmethodsaverageoveracellpopulationof̴107cells

• Single-cellmethodsuncoverhiddenstructure:– Diversesub-populationsofimmunecells– Clonalstructurewithintumours– Rarecirculatingtumourcellsfromblood– Asynchronouscellulardynamics– Eachcellisnowahigh-dimensionaldatapoint

Page 24: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Clusteringsinglecellproteindata

Amiretal. NatureBiotech.2013

Page 25: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Uncoveringclonalevolutionintumours

Time

Normal cells

t0 t1 t2 t3 tsample

Tissue volumeat time of sampling

A

ABD

ABC

Genotypes

20%

15%

25%

40%

Clones

Life history of the tumor Poly-clonal tumor at sampling

0

Clonal evolution tree

15

20

0

A

AB

40

ABD

25

ABC

FlorianMarkowetz,CRUKCambridge– fromhisblog“ScientificB-sides”

Page 26: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Approach

Targeted:• BasicCNAtoverifyCTCstatus• Target1-20genes• UseWBCsas–ve controls

GenomeWide:• Copynumberalteration(CNA)• WES- comprehensiveanalysis• UseWBCsas–ve controls

6SCLCpatientschosenwith=>4singleisolatedCTCsandCTCpoolsCNAdatafrom6,682cancer-relatedprotein-codinggenes

TP53

* Poolof10CTCs

** * * * * * *

Circulatingtumourcells(CTC)profiling

Expandedstudyongoing,2000CTCsfrom30patients

CTCenrichmentviaCellSearchCTCisolationviaDepArray

CarolineDiveandGed Brady,CRUKManchesterInstitute

Page 27: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Modellingchallenge:confoundingvariation

Stegle etal.NatureReviewsGenetics2014

Page 28: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

SinglecelldataLastyear

Single-cellRNA-Seq103 cellsperexperiment107 sequencereadspercell104featuresextractedpercell

CyTOF proteinquantification103cellspersecond106 perexperiment30-50featurespercell

ThisyearSingle-cellRNA-Seq106 cellsperexperiment108 readspercell>105featurespercell

Singlecellmulti-omics

?

Page 29: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Whatarethepinchpoints?

• Datavolume:costandtransferspeed• Dataanalysis:scalablealgorithms• Dataquality:batcheffects,missingdata,missingmetadata,conceptdrift

• Dataintegration:multi-modalmodelling• Reproducibleandrobustresearch

Page 30: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Datavolume

• Movealgorithmstothedata– Putcomputeclosetolocaldata– Commercialcloud(e.g.BaseSpace,Cytobank)– Bespokesecurecloud(e.g.100Kgenomesproject)

• Issuestoconsider– Willyouralgorithmsgivesameresults?– Willtheanalysisbereproducibleinthefuture?– Howtointegrateacrossresources?

Page 31: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Dataanalysis

• Scalingupalgorithms,e.g.DeeplearninglibrariesintegratingCPU/GPUarchitectures

• Fastapproximatemethods• Online/streamingdataprocessing• Avoidsolvingcompute-intensiveintermediatetasks:e.g.avoidgenomicalignmentpriortocountingsub-sequencematches(k-mers)

• Mixedprecisionnumerics

Page 32: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

MethodsforMachineLearningnolongersimplyassessedonpredictiveaccuracy

Dataanalysis

Page 33: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Dataquality

Bigcollecteddataaretypicallynotdesignedforasingleresearchquestion(oranyresearchquestion)

Weneedmethodstodealwith:

Confounders,batcheffects,missingdata,missingmetadata,conceptdrift,outliers….

(whileremainingscalable)

Page 34: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales
Page 35: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

RobustandreproducibleresearchPublishdata,code,workflows,versionnumbers,containers…

Resultsshouldnotdependstronglyonarbitrarymodellingchoices“shakethemodel”(ChrisHolmes)

“Hypothesisselection”leadstoupwardsignificancebias• Trytobreakyourmodels• Userobustmodels• Usebootstrapping

Keeptrackofallhypothesesyouhaveconsidered• Storeyourworkinghistory– notebookscience• Publishnegativeresults

Page 36: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Robustandreproducibleresearch• Buildreproducibilityintoyourroutine– don’twaituntilafter

yourpaperisaccepted• Don’tfeaturehere:

Page 37: What are the challenges for Data Science?...Hincks, S., Kingston, R., Webb, B. and Wong, C. (in press) A New Geodemographic Classification of Commuting Flows for England and Wales

Conclusion

• Researchisincreasinglydata-drivenacrossallfields– DataScienceisnowubiquitous

• Newchallengescomefromthescale,complexityandnatureofdata:Bigdata– scalablealgorithmsandarchitecturesComplexdata– bettermodels:bottomupandtopdownMessydata– statisticalthinkingisessentialHumandata– ethicaldimensionsareofkeyimportanceAccessibledata– avaluablecommonresource