lecture 15: course conclusion

1Serena Yeung BIODS 220: AI in Healthcare Lecture 15 -

Lecture 15:Course Conclusion


Announcements● TA office hours will continue to be project advising sessions during this week

○ Sign up on spreadsheet (see Ed announcement)○ Attendance is worth 5% of project grade

● Final Project Poster Session: Thu 12/9 12:15-3:15pm ● Final Project Report due Fri 12/10, 11:59pm


This course: foundations of AI in healthcare


Convergence of key ingredients of deep learning Algorithms Compute

Data


Different classes of neural networks

...

Input sequence

Output sequence

Fully connected neural networks(linear layers, good for “feature vector” inputs)

Convolutional neural networks(convolutional layers, good for image inputs)

Recurrent neural networks(linear layers modeling recurrence relation across

sequence, good for sequence inputs)


Neural network parameters:

Output:

Loss function (regression loss, same as before):

Per-example:

Over M examples:

Gradient of loss w.r.t. weights:Function more complex -> now much harder to derive the expressions! Instead… computational graphs and backpropagation.

Two-layer fully-connected neural network


Input

Softmax

3x3 conv, 64

7x7 conv, 64, / 2

FC 1000

Pool

3x3 conv, 64

3x3 conv, 643x3 conv, 64

3x3 conv, 643x3 conv, 64

3x3 conv, 1283x3 conv, 128, / 2

3x3 conv, 1283x3 conv, 128

3x3 conv, 1283x3 conv, 128

..

.

3x3 conv, 5123x3 conv, 512, /2

3x3 conv, 5123x3 conv, 512

3x3 conv, 5123x3 conv, 512

PoolResNet[He et al., 2015]

relu

Residual block

3x3 conv

3x3 conv

Xidentity

F(x) + x

F(x)

relu

X

Full ResNet architecture:- Stack residual blocks- Every residual block has

two 3x3 conv layers- Periodically, double # of

filters and downsample spatially using stride 2 (/2 in each dimension)

- Additional conv layer at the beginning

- No FC layers at the end (only FC 1000 to output classes)

No FC layers besides FC 1000 to output classes

Slide credit: CS231n


Common loss functionsRegression Binary Cross-Entropy

Label is a continuous value.

Minimize squared difference between prediction output and target

Equivalent to the negative log of the probability of the correct ground truth class being predicted. Think about what the expression looks like when y_i = 1 vs. 0.

Label is binary in {0,1}. Prediction is a real number in (0,1) and is the probability of the label being 1. It is usually the output of a sigmoid operation after the final layer.

Softmax

Label is 1 of K classes in {0, …, K}. Extension of binary cross-entropy loss to multiple classes. s_j corresponds to the score (e.g. output of final layer) for each class; the fraction in the log provides a normalized probability for each class.

Negative log of the probability of the true class y_i, as with the BCE loss. SVM

Label is 1 of K classes in {0, …, K}. Same use case as softmax, but different way of encouraging the model to produce outputs that we “like”. In practice, softmax is more popular and provides a nice probabilistic interpretation.

Incurs lowest loss of 0 (what we want) if the score for the true class y_i is greater than the score for each incorrect class j by a margin of 1


- Receiver Operating Characteristic (ROC) curve:

- Plots sensitivity and specificity (specifically, 1 - specificity) as prediction threshold is varied

- Gives trade-off between sensitivity and specificity

- Also report summary statistic AUC (area under the curve)

Evaluation metrics

True Positive Rate (TPR)

False Positive Rate (FPR)


Ciompi et al. 2015

Ciompi et al. Automatic classification of pulmonary peri-fissural nodules in computed tomography using an ensemble of 2D views and a convolutional neural network out-of-the-box. Medical Image Analysis, 2015.

- Task: classification of lung nodules in 3D CT scans as peri-fissural nodules (PFN, likely to be benign) or not

- Dataset: 568 nodules from 1729 scans at a single institution. (65 typical PFNs, 19 atypical PFNs, 484 non-PFNs).

- Data pre-processing: prescaling from CT hounsfield units (HU) into [0,255]. Replicate 3x across R,G,B channels to match input dimensions of ImageNet-trained CNNs.


Gulshan et al. 2016- Dataset:

- 128,175 images, each graded by 3-7 ophthalmologists.

- 54 total graders, each paid to grade between 20 to 62508 images.

- Data preprocessing: - Circular mask of each image was detected

and rescaled to be 299 pixels wide- Model:

- Inception-v3 CNN, with ImageNet pre-training- Multiple BCE losses corresponding to different

binary prediction problems, which were then used for final determination of referable diabetic retinopathy

Graders provided finer-grained labels which were then consolidated into (easier) binary prediction problems

Gulshan, et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA, 2016.


Richer visual recognition tasks: segmentation and detection

Figures: Chen et al. 2016. https://arxiv.org/pdf/1604.02677.pdf

Classification

Output: one category label for image (e.g., colorectal

glands)

Semantic Segmentation

Detection InstanceSegmentation

Output: category label for each pixel

in the image

Output: Spatial bounding box for

each instance of a category object in the

image

Output: Category label and instance

label for each pixel in the image

Distinguishes between different instances of an object

https://arxiv.org/pdf/1604.02677.pdf


Lung nodule segmentation

Liu et al. Segmentation of Lung Nodule in CT Images Based on Mask R-CNN. 2018.

- E.g. Liu et al. 2018

- Dataset: Lung Nodule Analysis (LUNA) challenge, 888 512x512 CT scans from the Lung Image Data Consortium database (LIDC-IDRI).

- Performed 2D instance segmentation in 2D CT slices

We will see other ways to handle 3D medical data types in the next lecture


Example: instance segmentation of cell nuclei


3D convolutions

Figure credit: https://www.researchgate.net/profile/Deepak_Mishra19/publication/330912338/figure/fig1/AS:723363244810254@1549474645742/Basic-3D-CNN-architecture-the-3D-filter-is-convolved-with-the-video-in-three-dimensions.png

Slide filter along 3 directions:x, y, and z!

When might you use 3D convolutions?

Ex: 224 x 224 x 1 x 256 3D CT scan (with 256 slices)

Ex: 224 x 224 x 3 x 500 video data (with 500 temporal frames)

x,y,z are spatial and/or temporal dimensions. Filter (e.g. 5 x 5 x 3 x 10 filter) goes all the way through the “channels” dimension as before.

x y z

channels (e.g. R,G,B)


I3D: 3D convolutional network for video dataInception Module (Inc.) w/

3D convolutions3D Inception Module used in Inception Network (also known as GoogLeNet)

3D convs

Carreira and Zisserman. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. CVPR 2017.

Can pre-train from 2D datasets e.g. ImageNet by replicating and normalizing 2D weights over additional dimension!

Note: in general, can 3D-ify many 2D architectures!


For richer visual recognition tasks, can also extend respective CNN architectures to use 3D convolutions

Figures: Chen et al. 2016. https://arxiv.org/pdf/1604.02677.pdf

Classification

Output: one category label for image (e.g., colorectal

glands)

Semantic Segmentation

Detection InstanceSegmentation

Output: category label for each pixel

in the image

Output: Spatial bounding box for

each instance of a category object in the

image

Output: Category label and instance

label for each pixel in the image

https://arxiv.org/pdf/1604.02677.pdf


E.g. 3D U-NetEx: 3D segmentation of Xenopus kidney in confocal microscopic data

Spatial dims: ~ 250 x 250 x 60. 3 channels: each channel corresponds to a different type of data capture

Used only 3 samples total! (with total of 77 annotated 2D slices). Leverages fact that each sample contains many instances of same repetitive structures w/ variation.

Cicek et al. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. MICCAI 2016.


What are electronic health records?

Figure credit: Rajkomar et al. 2018

Patient chart in digital form, containing medical and treatment history

Medical imaging and lab test results and reports


A real example of EHR data: MIMIC-III dataset

Johnson et al. MIMIC-III, a freely accessible critical care database. 2016.


CPT (Current procedural terminology): procedures and services codes

Johnson et al. MIMIC-III, a freely accessible critical care database. 2016.Additional figure credit: https://d20ohkaloyme4g.cloudfront.net/img/document_thumbnails/e570ad571499b88c8814e7366594e9bd/thumb_1200_1553.png


(Vanilla) Recurrent Neural Network

x

RNN

y

The state consists of a single “hidden” vector h:

Fully connected layersSlide credit: CS231n


h0 fW h1 fW h2 fW h3

x3

yT

…

x2x1W

RNN: Computational Graph: Many to Many

hT

y3y2y1 L1L2 L3 LT

L

Slide credit: CS231n


Harutyunyan et al.: phenotypes- Input: Time-series data corresponding to entire ICU stay- Output: Multilabel classification of the presence of 25 acute care

conditions (merged from ICD codes) in stay record

Figure credit: Harutyunyan et al. Multitask learning and benchmarking with clinical time series data. 2019.

Q: Why do we formulate this as

a multi-label classification

task?

Q: What loss function should

we use?

A: Comorbidities (co-occurring conditions)

A: Multiple binary cross-entropy losses


OMOP Common Data Model

Figure credit: https://ohdsi.github.io/TheBookOfOhdsi/images/CommonDataModel/cdmDiagram.png


FHIR

Figure credit: Choi et al. OHDSI on FHIR Platform Development with OMOP CDM mapping to FHIR Resources. 2016.

Data from all sources can be written in an OMOP data repository for analysis


Data representation

Raw data as FHIR resources

Rajkomar et al. Scalable and accurate deep learning with electronic health records. Npj Digital Medicine, 2018.


Token embeddings

[0 0 1 0 0 0 0 …. 0]

0.5 0.2 0.1

0.6 0.1 0.6

0.5 0.8 0.2

0.7 0.9 0.3

0.3 0.5 0.1

0.7 0.8 0.1

...

X = [0.5 0.8 0.2]

N x D embedding matrix

1xN token input (one-hot selection of token)

D-dim token embedding

In general, learning embedding matrices are a useful way to map discrete data into a semantically meaningful, continuous space! Will see frequently in natural language processing.


Today: Token Word Embeddings

[0 0 1 0 0 0 0 …. 0]

0.5 0.2 0.1

0.6 0.1 0.6

0.5 0.8 0.2

0.7 0.9 0.3

0.3 0.5 0.1

0.7 0.8 0.1

...

X = [0.5 0.8 0.2]

N x D embedding matrix

1xN token input (one-hot selection of token)

D-dim token embedding

Words come from a discrete vocabulary! Can learn word embeddings using a similar framework


Skip-gram model

E

xt

ht

xt-2 xt-1 xt+1 xt+2

Word embedding (feature vector), of word at the t-th position

Use word embedding vector to predict the word identity of a set of neighboring positions(Each is an N-way classification if the dictionary has N words)

Can train using a classification loss (e.g. softmax loss) based only on the text structure, without any external labels!

Lt-2 Lt-1Lt+

1

Lt+

2

Captures notion that words occurring in similar contexts should have similar feature vectors (word embeddings)

Aside: trying to learn “good” feature representations using loss functions based on inherent structure in data, as opposed to external labels, is a currently active area of research called “self-supervised learning”Mikolov, et al. Efficient Estimation of Word Representations in Vector Space, 2013.


Transformer architecture framework - Recent approach for sequence processing based on “self-attention” (Vaswani et al. 2017). BERT uses a stack of “encoder layers” each with self-attention (original Transformer also had decoder layers).

Encoder Layer

Encoder Layer

Encoder Layer

...

abnormal findings lung...

Encoder Stack

Encoder self-attention

Feed-forward

Encoder Layer

Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2018.Vaswani et al. Attention is All You Need, 2017.


Training BERT

Encoder Layer

Encoder Layer

Encoder Layer

...

abnormal findings lung...

Encoder Stack

Encoder self-attention

Feed-forward

Encoder Layer

Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2018.Vaswani et al. Attention is All You Need, 2017.

CLS MASK

1. Predict randomly masked words in sentence inputs (classification)

Input sequences with a start

token

2. Input sentence pairs separated by a [SEP] token, predict whether the 2nd sentence follows the 1st in the text


ClinicalBERT: training on clinical notes (from MIMIC)

Huang et al. ClinicalBert: Modeling Clinical Notes and Predicting Hospital Readmission, 2019.

Fine-tuning ClinicalBERT for prediction of 30-day hospital readmission:

Use hidden state corresponding to [CLS] token

When performing prediction from long sequences, obtain predictions for each sentence separately and then combine


Some biology basics: starting from DNA

Figure credit: virtualmedicalcentre.comFigure credit: https://en.wikipedia.org/wiki/Nucleobase#/media/File:DNA_chemical_structure.svg

https://en.wikipedia.org/wiki/Nucleobase#/media/File:DNA_chemical_structure.svg


Transcription and translation

Figure credit: https://www.cancer.gov/images/cdr/live/CDR761782-571.jpg

Transcription: DNA -> RNA

Translation: RNA -> Protein


Many data types, e.g. RNA-seq

Produces readout of mRNA content in a tissue sample

Figure credit: https://cdn.technologynetworks.com/tn/images/body/dnasequencinga1529596208892.png

Map back to reference genome for analysis

Now standard approach for transcriptomics study

More recently in 2010s, single-cell RNA-seq!


ENCODE: identifying and analyzing all functional elements in the human genome

Figure credit: https://www.encodeproject.org/

- Launched by US National Human Genome Research Institute in 2003

- Contributions from worldwide consortium of research groups

https://www.encodeproject.org/


DeepSea

Zhou and Troyanskaya. Predicting effects of noncoding variants with deep learning–based sequence model. Nature Methods, 2015.

Predict chromatin effects of (non-coding) sequence alterations with single-nucleotide sensitivity (SNPs: single nucleotide polymorphism)

Input: DNA sequence pair with SNPOutput: Predicted chromatin effects (919 total)

- 690 transcription factor profiles- 125 DNase I hypersensitive sites (DHS)

profiles (looser chromatin structure, easier protein binding)

- 104 histone-mark profiles (histone modifications)

Multi-task training!

Multi-task prediction of 919 chromatin profiles, for each allele (variant)


Multimodal dataCan be very similar, e.g. different image acquisition variants

Figure credit: Dong et al. MIUA, 2017.


Multimodal dataOr very different, e.g. different types of clinical data

Figure credit: Rajkomar et al. 2018.


Categorizations of multimodal models

Joint fusion: Both modality-specific components (with learnable parameters) and combined-modality components within the model, that are updated during model training

Huang et al. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines, 2020.


How can we produce good labels from noisy sources? More sophisticated approach: learn models for how to best aggregate noisy labeling functions!

Dunmon et al. Cross-Modal Data Programming Enables Rapid Medical Machine Learning, 2020.Figure credit: Nishith Khandwala et al., 2017.


AI and COVID-19- Detection of COVID-19 from CT images- 2 stage process: lung segmentation followed by classification of COVID-19 or not- Multinational dataset of 2724 scans from 2617 patients, with 1029 scans (922) patients

confirmed positive for COVID-19

Harmon et al. Artificial intelligence for the detection of COVID-19 pneumonia on chest CT using multinational datasets, 2020.

46Serena Yeung BIODS 220: AI in Healthcare Lecture 15 - 46

Data: x

Just data, no labels!

Goal: Learn some underlying hidden structure of the data

Examples: Clustering, representation / feature learning, density estimation, etc.

Other paradigms of machine learning:Unsupervised learning

Representation learning

Encoder

Input data

Features

Unsupervised training objective


Darabi 2019- Autoencoder-based unsupervised representation learning for multimodal data of 200,000 records

from 250 hospital sites (eICU collaborative Research Database)

- Used feature representation to train models for downstream mortality, readmission prediction tasks

Darabi et al. Unsupervised Representation for EHR Signals and Codes as Patient Status Vector, 2019.

Autoencoder for each code-based modality (e.g. medication, treatment, diagnosis), and signal time-series (e.g. heart rate)


Decoder network

Sample z from

Sample x|z from

Use decoder network. Now sample z from prior!

Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR 2014

Variational autoencoders can also be used to sample new (synthetic) data

Data manifold for 2-d z

Vary z1

Vary z2


Generator network: try to fool the discriminator by generating real-looking imagesDiscriminator network: try to distinguish between real and fake images

Ian Goodfellow et al., “Generative Adversarial Nets”, NIPS 2014GANs: Two-player game

49

zRandom noise

Generator Network

Discriminator Network

Fake Images(from generator)

Real Images(from training set)

Real or Fake

Fake and real images copyright Emily Denton et al. 2015. Reproduced with permission.


Example: GAN-based medical image synthesis

Liver lesions of different types (Frid-Adar 2018)

Dermatology lesions (Ghorbani 2019)

Brain MRIs with lesions (Han 2018)

Can be used for data augmentation!


Problems involving an agent interacting with an environment, which provides numeric reward signals

Goal: Learn how to take actions in order to maximize reward

Atari games figure copyright Volodymyr Mnih et al., 2013. Reproduced with permission.

A third paradigm of learning: reinforcement learning


:neural network with weights

Q-network architecture

52

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values)

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Output expected future reward from taking each of the 4 possible actions


Example: Raghu et al. 2017

Learned a Q-learning based policy to take treatment actions for sepsis patients, using the MIMIC dataset

5x5 possible policy actions at any timestep

Raghu et al. Deep Reinforcement Learning for Sepsis Treatment, 2017.


Interpretability: a challenge in deep learning

https://www.cs.cmu.edu/~bhiksha/courses/10-601/decisiontrees/DT.png

vs.


Saliency Maps: Class Activation Maps (CAM)- Zhou et al. 2015- Visualizes heatmap

(class activation map) indicating the importance of the activation at spatial grid (x, y) leading to the classification of an image to class c.

Zhou et al. Learning Deep Features for Discriminative Localization, 2016.

Weight (importance) of kth filter activation for predicting cth class


Rajpurkar et al. 2017- Binary classification of pneumonia

presence in chest X-rays- Used ChestX-ray14 dataset with over

100,000 frontal X-ray images with 14 diseases

- 121-layer DenseNet CNN- Compared algorithm performance with 4

radiologists- Also applied algorithm to other diseases to

surpass previous state-of-the-art on ChestX-ray14

Rajpurkar et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. 2017.

CAM visualization


Ethics: many questions around AI / human collaboration in medicine

- How to make diagnosis and/or care decisions when the algorithm disagrees with the human?

- How should AI algorithms work together with humans?- How to handle machine error vs. human error?- How to make sure AI algorithms don’t (perhaps inadvertently) discriminate

against certain populations?- How to handle tradeoffs between algorithmic performance on some groups

vs. others?


Chen et al. 2019- Showed discrepancies in error rates by race, gender, insurance type, etc. for

models trained to make clinical predictions on MIMIC-III data

Error rate for predicting ICU mortality by gender

Chen et al. Can AI Help Reduce Disparities in General Medical and Mental Health Care? 2019.


More on fairness… there are many possible definitions of fairness!

- Group-independent predictions: predictions should be independent of group membership

- Equal metrics across groups: e.g. equal true positive rates or false positive rates across groups

- Individual fairness: individuals who are similar with respect to a prediction task should have similar outcomes

- Causal fairness: e.g. there should not be a causal pathway from a sensitive attribute to the outcome prediction

Suresh and Guttag. A Framework for Understanding Unintended Consequences of Machine Learning, 2020.

Cannot satisfy all of these simultaneously: satisfying “fairness” according to one definition generally leads to a trade-off respect to another definition!


Mitchell 2019: Model cards for Model Reporting- Documentation accompanying trained models to detail performance characteristics

Mitchell et al. Model Cards for Model Reporting, 2019.


Gebru 2020: Datasheets for Datasets

Gebru et al. Datasheets for Datasets. 2020.


Federated Learning- Related to distributed computing, but with an important property for many medical

settings: data is decentralized and never leaves local silos. Central server controls training across decentralized sources.

Figure credit: https://blogs.nvidia.com/wp-content/uploads/2019/10/federated_learning_animation_still_white.png


Li et al. 2019- NVIDIA Clara’s Federated Learning system for medical imaging data

- Used federated learning to train segmentation model on BRATS

- Achieved comparable performance to non-federated learning, training somewhat slower but data “silos” preserved

Li et al. Privacy-preserving Federated Brain Tumour Segmentation, 2019.


Differential privacyKey idea: output for a dataset, vs. the dataset with a difference for a single entry (e.g., one individual), is “hardly different”. Mathematical guarantees on this idea.

Abadi et al. Deep Learning with Differential Privacy, 2016.


Differential privacySimple intuition behind how we can achieve differential privacy: adding noise!

Figure credit: https://github.com/frankmcsherry/blog/blob/master/posts/2016-02-03.md

Example of reporting a value with Laplacian noise added


Training differentially private deep learning models

Abadi et al. Deep Learning with Differential Privacy, 2016.

Add noise for differential privacy


Implementation of DP-SGD

Utilities for calculating epsilon

Can work with differential privacy within deep learning frameworks

https://blog.tensorflow.org/2019/03/introducing-tensorflow-privacy-learning.htmlhttp://www.cleverhans.io/privacy/2019/03/26/machine-learning-with-differential-privacy-in-tensorflow.html


Where to go from here?- More deep learning courses, e.g. focusing on different domains

- CS 221 and 229: broader AI courses- CS 231N: computer vision- CS 224N: natural language processing- CS 224S: spoken language processing- Many more!: https://ai.stanford.edu/courses/

- More biomedicine focused courses- CS/BMI 273B: deep learning in genomics- CS/BMI 279: computational biology- BMI 217: translational bioinformatics- Many more!: (BMI courses)

https://explorecourses.stanford.edu/search?view=catalog&filter-coursestatus-Active=on&page=0&catalog=&academicYear=&q=BIOMEDIN&collapse=

- (BIODS courses) https://explorecourses.stanford.edu/search?q=BIODS&view=catalog&academicYear=&catalog=&page=0&filter-coursestatus-Active=on&collapse=

- Many research and internship opportunities as well

https://ai.stanford.edu/courses/



https://explorecourses.stanford.edu/search?q=BIODS&view=catalog&academicYear=&catalog=&page=0&filter-coursestatus-Active=on&collapse=



Where to go from here?- More deep learning courses, e.g. focusing on different domains

- CS 221 and 229: broader AI courses- CS 231N: computer vision- CS 224N: natural language processing- CS 224S: spoken language processing- CS 236: generative models- Many more!: https://ai.stanford.edu/courses/

- More biomedicine focused courses- CS/BMI 273B: deep learning in genomics- CS/BMI 279: computational biology- BMI 217: translational bioinformatics- Many more!: (BMI courses)


- (BIODS courses) https://explorecourses.stanford.edu/search?q=BIODS&view=catalog&academicYear=&catalog=&page=0&filter-coursestatus-Active=on&collapse=

https://ai.stanford.edu/courses/






Thank you!

lecture 15: course conclusion

Documents