microsoft code of conduct€¦ · the world of genomics is rapidly evolving an explosion of new...
TRANSCRIPT
Microsoft Code of Conduct
Microsoft’s mission is to empower every person and every organization on the planet to achieve more. This
includes all Microsoft events and gatherings, including on digital platforms, where we seek to create a
respectful, friendly, fun and inclusive experience for all participants.
We expect all digital event participants to uphold the principles of this Code of Conduct, which covers the
main digital event and all related activities. We do not tolerate disruptive or disrespectful behavior, messages,
images, or interactions by any party participant, in any form, at any aspect of the program including business
and social activities, regardless of location.
Microsoft will not tolerate harassment or discrimination based on age, ancestry, color, gender identity or
expression, national origin, physical or mental disability, religion, sexual orientation, or any other
characteristic protected by applicable local laws, regulations, and ordinances.
We encourage everyone to assist in creating a welcoming and safe environment. Please report any concerns,
harassing behavior, suspicious, or disruptive activity to Business Conduct Hotline (1-877-320-MSFT or
[email protected]). Microsoft reserves the right to refuse admittance to or remove any person from
Microsoft Build at any time at its sole discretion.
An introduction to genomics data analysis
on the Azure Cloud
Roberto Lleras
Senior Scientist
Microsoft Genomics
Agenda How do we fully harness the power
of genomics in the 21st century?
Genomics in the cloud
Challenges in genomics and data science:
how do we put it all together?
Structure of DNA—deoxyribonucleic acid
Double stranded, stable storage material
Each cell contains the entire genome
4 letters alphabet: A, C, G, T
Human genome (23 chromosomes)
is 3.2 billion bases long (and encoded 2x per cell)
The future of healthcare (and many other industries)
is powered by genomics
Tailor drugs to patients for
more effective treatment
Design personalized cancer
treatments based on analysis
of tissue from the tumor
Use gene therapy to treat
or prevent disease
Rapidly identify infectious
agents in the environment
and report that information
Determine the cause of
developmental issues in
newborns sooner
Predict inherited disease
A single Illumina NovaSeq 6000
can generate around 300
terabytes of data every year.1
Sequencing a whole human
genome generates around
100 gigabytes of data.
By 2025, it is estimated that as
much as 40 exabytes of storage
capacity will be needed for human
genomic data.2
1 Illumina2 Z. D. Stephens et al., PLOS Biology, 2015
Analysis of a whole human
genome requires hundreds of
core-hours of compute time.
It would require over 9 million
core-hours to analyze the
genomic data of everyone in
New York’s Madison Square Garden.
Integrating genomic data into
Data Lakes for AI + ML data
science requires specialized
knowledge and tooling.
Data
Compute
Genomics at scale: a storage and compute challenge
Genomics (+AI and ML) in the cloud
Sequencing data
>CTACGGT
ACTTACGGACGCGAGAGCGGCATTTACCT
>CTACGGT
ACTTACGGACGCGAGAGCGGCATTTACCT
>CTACGGT
ACTTACGGACGCGAGAGCGGCATTTACCT
Bulk data transfer and storage
(100s of GB per sample)
>CTACGGT
ACTTACGGACGCGAGAGCGGCATTTACCT
>CTACGGT
ACTTACGGACGCGAGAGCGGCATTTACCT
Sequence alignment
(NP-hard problem per sample)
Sample metadata + more
Reference: ATTACGGATTACCATGGGCATTTASample: ATTACGGATTGCCATGGGCATTTASample: ATTACGGATTGCCATGGGCATTTASample: ATTACGGATTGCCATGGTCATTTA
Variant calling and filtration
(Train ML models to more accurately spot mutations)
Healthy Individuals
Affected Individuals
Variant interpretation
(Integrate metadata + variants to assess impact
of mutations)
Clinical data
Socioeconomic data
Image data
Known variants
Patient histories
Disease research
Reference databasesStructure and store
other datatypes for easy access
Human interface and hypothesis testing
(Assess model performance and determine experimental results)
Human interpretation
Genomics (+AI and ML) in the cloud
Sequencing data Sample metadata + more
Clinical data
Socioeconomic data
Image data
Reference databases
Human interpretationHuman interface research
Data security and governance
Machine learning
Model generation
Statistical analyses
Data science research
Data structures
Machine vision
Natural language
processing
Algorithms development
Artificial intelligence
Hardware acceleration
Algorithms development
Hardware acceleration
Bioinformatics
Data transfer
Compression
Archiving
Data sharing
Genomics (+AI and ML) in the cloud
Sequencing data Sample metadata + more
Clinical data
Socioeconomic data
Image data
Reference databases
Human interpretation
Specialized hardware (FPGAs, GPUs, TPUs)
Genomics/Omics
(structured)
Medical Imaging
(unstructured)
EMR
(unstructured)
Business apps
(structured)
Notes
(unstructured)
HD
Insights
Power BI
Machine
Learning
Data Lake
Analytics
Cognitive
Services SharePoint
Data
FactoryData
Store
Stream
AnalyticsAzure
Databricks
SQL Data
Warehouse
Genomics/Omics
(unstructured) Teams
INGEST PREP MODEL & TRAIN VISUALIZE & SHARE
Insights through multimodal data analytics pipelines
The newspaper problem
Example from www.bioinformaricsalgorithms.com
The newspaper problem as an overlapping puzzle
Example from www.bioinformaricsalgorithms.com
Multiple unsequenced copies
Randomly fragment the genome
Resulting overlapping reads;
The higher the coverage the better the quality
Apply bioinformatics tools
to reassemble the reads
Unordered sequenced segments (reads),
2–3 billion reads
Requires a large amount of computing power and storage capacity
From experimental to computational challenges
Secondary analysis pipeline
Best practices pipeline recommended by the Broad Institute of MIT
Microsoft Genomics—optimized service on Azure
Input size Compute time
Average 43 GB 5 h
Largest 398 GB 52 h
Average compute time 30 h
Accelerate precision medicine with Microsoft Genomics. Microsoft, 2018
The world of genomics is rapidly evolving
An explosion of
new technologies…
Long read sequencing
Single cell sequencing
Spatial sequencing
Optical mapping,
chromosome capture
Increased accessibility and accelerating innovation in a connected world demand a rapidly scalable platform elastic to conducting research and deployment at scale
Cromwell on Azure
Azure implementation of the Broad Institute’s Cromwell workflow engine using the GA4GH Task
Execution Service (TES) backend
Free OSS solution on GitHub made available under the MIT license
Easy to install, configure, and use
Leverages Azure Batch compute and Blob storage for near-infinite scalability
Support for authenticated access in a workgroup setting
Best Practice Pipelines supported on Cromwell—BWA/GATK, MuTect2, RNA-Seq, ATAC-Seq, etc.
Cromwell on Azure (Genomics Workflow Orchestration)
Azure Services (Storage, Compute, DB, ML, PBI)
Secondary Analysis Tertiary Analysis Presentation
Automated genomics + data science and ML pipeline
Visualize scientific discovery in real time
Microsoft Confidential 21
Q & A
What next?
Microsoft student resources can be found at the GitHub repository for
further learning opportunities.
aka.ms/StudentsAtBuild
Microsoft Learn for Students is the place to develop practical skills through
fun, interactive modules and paths. Plus, educators can get access to Microsoft
classroom materials and curriculum. Find it all at: aka.ms/learnforstudents
Azure for Students gives you $100 in credit on the Azure Cloud. Build your
skills in trending tech including data science, artificial intelligence (AI),
machine learning, and other areas with access to professional developer tools.
Start here: aka.ms/azureforstudents
Imagine Cup is more than just a competition—you can work with friends (and
make new ones), network with professionals, gain new skills, make a difference
in the world around you, and get the chance to win cash and cloud credits.
To find out more: Higher education students: imaginecup.com/. Educators of
students ages 13–18 start with Imagine Cup Jr.
Microsoft Student Learn Ambassadors are a global group of campus leaders who
are eager to help fellow students, lead in their local tech community, and develop
technical and career skills for the future.
Learn more at: studentambassadors.microsoft.com/
Interested in genomics?
Bioinformatics algorithms
bioinformaticsalgorithms.com
Tools
Cromwell on Azure: https://github.com/microsoft/CromwellOnAzure
MS Genomics https://www.microsoft.com/en-us/genomics
Microsoft genomics service architecture
“msgen” CLI
Customer
Blob Storage
Azure Portal
API Management
Primary DB
App Insights
Monitoring
Azure AD
Genomics Admin
Data Warehouse
Microsoft Genomics Region Boundary
REST API
Web App
Resource Provider
Admin Portal
Data Factory
Internal Reference
Data
Logs
Azure Batch VM Pools
© Copyright Microsoft Corporation. All rights reserved.