machine learning for programs - github pages · learning and probabilistic models based on big data...

209
Martin Vechev Pavol Bielik Department of Computer Science ETH Zurich Machine Learning for Programs

Upload: others

Post on 25-Jun-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Martin Vechev Pavol Bielik

Department of Computer Science

ETH Zurich

Machine Learning for Programs

Page 2: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Learning and probabilistic models based on Big Data have revolutionized entire fields

Natural Language Processing (e.g., machine translation)

Computer Vision (e.g., image captioning)

A group of people shopping at an outdoor market. There are many vegetables at the fruit stand.

Medical Computing (e.g., disease prediction)

Page 3: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Can we bring this revolution to programmers?

Learning and probabilistic models based on Big Data have revolutionized entire fields

Natural Language Processing (e.g., machine translation)

Computer Vision (e.g., image captioning)

A group of people shopping at an outdoor market. There are many vegetables at the fruit stand.

Medical Computing (e.g., disease prediction)

Page 4: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Vision New kinds of techniques and tools that learn from Big Code

Statistical Programming Tool

Programming Task

Solution

probabilistic model

15 million repositories

Billions of lines of code

High quality, tested,

maintained programs

last 8 years

number of repositories

Page 5: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Martin

Vechev

Veselin

Raychev

Pavol

Bielik Christine

Zeller

Svetoslav

Karaivanov

Pascal

Roos

Benjamin

Bischel Andreas

Krause

More information: http://www.srl.inf.ethz.ch/

Probabilistically likely solutions to problems impossible to solve otherwise

Timon

Gehr

Publications

PHOG: Probabilistic Mode for Code, ACM ICML’16

Learning Programs from Noisy Data, ACM POPL’16

Predicting Program Properties from “Big Code”, ACM POPL’15

Code Completion with Statistical Language Models, ACM PLDI’14

Machine Translation for Programming Languages, ACM Onward’14

http://JSNice.org

statistical de-obfuscation

http://Nice2Predict.org

structured prediction framework

Publicly Available Tools

Petar

Tsankov

Probabilistic Learning from Big Code

Page 6: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

• Motivation • Applications

• Statistical Language Models

• N-gram, Recurrent Networks, PCFGs • Application: code completion

• Graphical Models

• Markov Networks, Conditional Random Fields • Inference & Learning in Markov Networks • Application: predicting names and types

• Learning Features for Programs • Combining Program Synthesis + Machine Learning

Tutorial Outline

45 min

15 min break

60 min

15 min break

45 min

Page 7: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Programming Tools

7

Write new code [PLDI’14]:

Code Completion

Camera camera = Camera.open();

camera.SetDisplayOrientation(90);

?

Page 8: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Programming Tools

8

Write new code [PLDI’14]:

Code Completion

Camera camera = Camera.open();

camera.SetDisplayOrientation(90);

?

Port code [ONWARD’14]:

Programming Language Translation

Page 9: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Programming Tools

9

...

for x in range(a):

print a[x]

Debug code:

Statistical Bug Detection

Write new code [PLDI’14]:

Code Completion

Camera camera = Camera.open();

camera.SetDisplayOrientation(90);

?

likely

error

Port code [ONWARD’14]:

Programming Language Translation

Page 10: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Programming Tools

10

Understand code/security [POPL’15]:

JavaScript Deobfuscation

Type Prediction

...

for x in range(a):

print a[x]

Debug code:

Statistical Bug Detection

Write new code [PLDI’14]:

Code Completion

Camera camera = Camera.open();

camera.SetDisplayOrientation(90);

?

likely

error

Port code [ONWARD’14]:

Programming Language Translation

www.jsnice.org

Page 11: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

JSNice.org

Every country Top ranked tool 150,000+ users

[Predicting program properties from Big Code, ACM POPL 2015]

Page 12: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Programming Tools

12

Understand code/security [POPL’15]:

JavaScript Deobfuscation

Type Prediction

...

for x in range(a):

print a[x]

Debug code:

Statistical Bug Detection

Write new code [PLDI’14]:

Code Completion

Camera camera = Camera.open();

camera.SetDisplayOrientation(90);

?

likely

error

Port code [ONWARD’14]:

Programming Language Translation

www.jsnice.org

All of these benefit from the probabilistic models for code.

Page 13: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

• Motivation • Applications

• Statistical Language Models

• N-gram, Recurrent Networks, PCFGs • Application: code completion

• Graphical Models

• Markov Networks, Conditional Random Fields • Inference & Learning in Markov Networks • Application: predicting names and types

• Learning Features for Programs • Combining Program Synthesis + Machine Learning

Tutorial Outline

13

Page 14: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Machine Learning for Programs

14

Page 15: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation

Machine Learning for Programs

15

Page 16: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation

Sequences

(sentences) Trees

Translation Table

Feature Vectors

Graphical Models (CRFs)

Machine Learning for Programs

16

Page 17: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation

alias analysis typestate analysis

Sequences

(sentences) Trees

Translation Table

Feature Vectors

control-flow

analysis scope analysis

Graphical Models (CRFs)

Machine Learning for Programs

17

Page 18: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation

alias analysis typestate analysis

N-gram language model

SVM Structured SVM

Neural Networks

Sequences

(sentences) Trees

Translation Table

Feature Vectors

control-flow

analysis scope analysis

Graphical Models (CRFs)

Machine Learning for Programs

18

Page 19: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation

alias analysis typestate analysis

N-gram language model

SVM Structured SVM

Neural Networks

Sequences

(sentences) Trees

Translation Table

Feature Vectors

control-flow

analysis scope analysis

argmax P(y | x)

y ∈ Ω

Greedy

MAP inference

Graphical Models (CRFs)

Machine Learning for Programs

19

Page 20: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation

alias analysis typestate analysis

N-gram language model

SVM Structured SVM

Neural Networks

Sequences

(sentences) Trees

Translation Table

Feature Vectors

control-flow

analysis scope analysis

argmax P(y | x)

y ∈ Ω

Greedy

MAP inference

Graphical Models (CRFs)

Machine Learning for Programs

20

Page 21: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Motivation: Working with APIs

21

Page 22: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Camera camera = Camera.open(); camera.setDisplayOrientation(90); camera.unlock(); MediaRecorder rec = new MediaRecorder(); rec.setCamera(camera); rec.setAudioSource(MediaRecorder.AudioSource.MIC); rec.setVideoSource(MediaRecorder.VideoSource.DEFAULT); rec.setOutputFormat(MediaRecorder.OutputFormat.MPEG_4); rec.setAudioEncoder(1); rec.setVideoEncoder(3); rec.setOutputFile("file.mp4"); ...

?

?

?

Statistical Code Synthesis: Capabilities [Code Completion with Statistical Language Models, V. Raychev, M. Vechev, E. Yahav ,PLDI’14]

22

Page 23: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Camera camera = Camera.open(); camera.setDisplayOrientation(90); camera.unlock(); MediaRecorder rec = new MediaRecorder(); rec.setCamera(camera); rec.setAudioSource(MediaRecorder.AudioSource.MIC); rec.setVideoSource(MediaRecorder.VideoSource.DEFAULT); rec.setOutputFormat(MediaRecorder.OutputFormat.MPEG_4); rec.setAudioEncoder(1); rec.setVideoEncoder(3); rec.setOutputFile("file.mp4"); ...

[Code Completion with Statistical Language Models, V. Raychev, M. Vechev, E. Yahav ,PLDI’14]

Statistical Code Synthesis: Capabilities

23

Page 24: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Camera camera = Camera.open(); camera.setDisplayOrientation(90); camera.unlock(); MediaRecorder rec = new MediaRecorder(); rec.setCamera(camera); rec.setAudioSource(MediaRecorder.AudioSource.MIC); rec.setVideoSource(MediaRecorder.VideoSource.DEFAULT); rec.setOutputFormat(MediaRecorder.OutputFormat.MPEG_4); rec.setAudioEncoder(1); rec.setVideoEncoder(3); rec.setOutputFile("file.mp4"); ...

Handles multiple

objects

[Code Completion with Statistical Language Models, V. Raychev, M. Vechev, E. Yahav ,PLDI’14]

Statistical Code Synthesis: Capabilities

24

Page 25: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Camera camera = Camera.open(); camera.setDisplayOrientation(90); camera.unlock(); MediaRecorder rec = new MediaRecorder(); rec.setCamera(camera); rec.setAudioSource(MediaRecorder.AudioSource.MIC); rec.setVideoSource(MediaRecorder.VideoSource.DEFAULT); rec.setOutputFormat(MediaRecorder.OutputFormat.MPEG_4); rec.setAudioEncoder(1); rec.setVideoEncoder(3); rec.setOutputFile("file.mp4"); ...

Handles multiple

objects

Infers multiple

statements

[Code Completion with Statistical Language Models, V. Raychev, M. Vechev, E. Yahav ,PLDI’14]

Statistical Code Synthesis: Capabilities

25

Page 26: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Camera camera = Camera.open(); camera.setDisplayOrientation(90); camera.unlock(); MediaRecorder rec = new MediaRecorder(); rec.setCamera(camera); rec.setAudioSource(MediaRecorder.AudioSource.MIC); rec.setVideoSource(MediaRecorder.VideoSource.DEFAULT); rec.setOutputFormat(MediaRecorder.OutputFormat.MPEG_4); rec.setAudioEncoder(1); rec.setVideoEncoder(3); rec.setOutputFile("file.mp4"); ...

Handles multiple

objects

Infers sequences

not in training data

Infers multiple

statements

Statistical Code Synthesis: Capabilities [Code Completion with Statistical Language Models, V. Raychev, M. Vechev, E. Yahav ,PLDI’14]

26

Page 27: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Regularities in code are similar to regularities in natural language

Key Insight

27

Page 28: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Regularities in code are similar to regularities in natural language

We want to learn that MediaRecorder rec = new MediaRecorder(); is before

rec.setCamera(camera);

Key Insight

28

Page 29: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Regularities in code are similar to regularities in natural language

We want to learn that MediaRecorder rec = new MediaRecorder(); is before

rec.setCamera(camera); like in natural languages

Hello

is before

World !

Key Insight

29

Page 30: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

From Programs to Sentences

she = new X(); me = new Y(); me.sleep(); if (random()) { me.eat(); } she.enter(); me.talk(she);

30

Page 31: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

From Programs to Sentences

she = new X(); me = new Y(); me.sleep(); if (random()) { me.eat(); } she.enter(); me.talk(she);

Typestate analysis

31

Page 32: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

From Programs to Sentences

she = new X(); me = new Y(); me.sleep(); if (random()) { me.eat(); } she.enter(); me.talk(she);

Typestate analysis Alias analysis

32

Page 33: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

From Programs to Sentences

she = new X(); me = new Y(); me.sleep(); if (random()) { me.eat(); } she.enter(); me.talk(she);

Typestate analysis Alias analysis for abstract object me:

33

Page 34: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

From Programs to Sentences

she = new X(); me = new Y(); me.sleep(); if (random()) { me.eat(); } she.enter(); me.talk(she);

Typestate analysis Alias analysis for abstract object me: Yinit sleep talk

34

Page 35: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

From Programs to Sentences

she = new X(); me = new Y(); me.sleep(); if (random()) { me.eat(); } she.enter(); me.talk(she);

Typestate analysis Alias analysis for abstract object me: Yinit sleep talk

Yinit sleep eat talk

35

Page 36: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

From Programs to Sentences

she = new X(); me = new Y(); me.sleep(); if (random()) { me.eat(); } she.enter(); me.talk(she);

Typestate analysis Alias analysis for abstract object me: Yinit sleep talk

Yinit sleep eat talk

for abstract object she:

36

Page 37: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

From Programs to Sentences

she = new X(); me = new Y(); me.sleep(); if (random()) { me.eat(); } she.enter(); me.talk(she);

Typestate analysis Alias analysis for abstract object me: Yinit sleep talk

Yinit sleep eat talk

for abstract object she: Xinit enter talkparam1

37

Page 38: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Training phase

Program

Analysis

sentences

From Programs to Sentences

38

Page 39: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Learn Regularities

Learn regularities in obtained sentences

Regularities in sentences ⇔ regularities in API usage

If we see many sequences like: Xinit enter talkparam1

then we should learn that talkparam1 is often after enter

39

Page 40: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Language Models

Given a sentence s = w1 w2 w3 … wn

estimate P( w1 w2 w3 … wn )

Queries:

P(“Hello World!”) ≫ P(“World! Hello”)

arg𝑚𝑎𝑥 P(“Hello” y)

40

y

Page 41: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Language Models

Given a sentence s = w1 w2 w3 … wn

estimate P( w1 w2 w3 … wn )

Decomposed to conditional probabilities

P( w1 w2 w3 … wn ) = ∏i=1..n P( wi | w1 … wi-1 )

41

Page 42: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Language Models

Given a sentence s = w1 w2 w3 … wn

estimate P( w1 w2 w3 … wn )

Decomposed to conditional probabilities

P( w1 w2 w3 … wn ) = ∏i=1..n P( wi | w1 … wi-1 )

P( The quick brown fox jumped ) = P( The ) P( quick | The ) P( brown | The quick ) P( fox | The quick brown ) P( jumped | The quick brown fox ) 42

Page 43: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

N-gram language model

Conditional probability only on previous n-1 words

P( wi | w1 … wi-1 ) ≈ P( wi | wi-n+1 … wi-1 )

43

Page 44: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

N-gram language model

Conditional probability only on previous n-1 words

P( wi | w1 … wi-1 ) ≈ P( wi | wi-n+1 … wi-1 )

n-1 words

44

Page 45: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

N-gram language model

Conditional probability only on previous n-1 words

P( wi | w1 … wi-1 ) ≈ P( wi | wi-n+1 … wi-1 ) ≈

n-1 words

#(wi-n+1 … wi-1 wi )

#(wi-n+1 … wi-1 )

45

Page 46: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

N-gram language model

Conditional probability only on previous n-1 words

P( wi | w1 … wi-1 ) ≈ P( wi | wi-n+1 … wi-1 ) ≈

n-1 words

P( jumped | The quick brown fox ) ≈ P( jumped | brown fox ) ≈

#( brown fox jumped )

#( brown fox )

#( n-gram ) - number of

occurrences of n-gram in

training data

Training is achieved by counting n-grams.

E.g., with 3-gram language model, we get:

Time complexity for each word

encountered in training is constant, so

training is usually fast.

#(wi-n+1 … wi-1 wi )

#(wi-n+1 … wi-1 )

46

Page 47: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Tri-gram language model

3-grams # of occurrences

brown fox jumped 125

brown fox walked 45

brown fox snapped 30

P (jumped | brown fox) ~= 125/200 ~= 0.625

P (w1 w2 w3 … wn) ~= P (w1) P (w2 w1) P (w3 w1 w2) … P (wn wn-2 wn-1)

47

Page 48: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Tri-gram language model

3-grams # of occurrences

brown fox jumped 125

brown fox walked 45

brown fox snapped 30

P (jumped | brown fox) ~= 125/200 ~= 0.625

P (brown fox jumped) ~=

P (brown) P (fox | brown) P (jumped | brown fox)

200 / 600 200 / 200 125 / 200 ~= 0.208

P (w1 w2 w3 … wn) ~= P (w1) P (w2 w1) P (w3 w1 w2) … P (wn wn-2 wn-1)

48

Page 49: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Key Problem: Sparsity of Data

The problem of sparsity gets worse as the size of the n-gram becomes larger.

We need to handle n-grams with 0 or few occurrences in the training data.

Techniques that can do that are: smoothing, discounting

#( red fox jumped )

#( red fox )

What if this number is 0?

P( jumped | red fox ) ≈

49

Page 50: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

• Smoothing techniques give non-zero probability to n-grams not in

the training data.

• Essentially, they try to estimate how likely it is that the n-gram is

missing due to the limited size of the training data.

• They work by taking the probability mass of the existing n-grams

and redistributing that mass over n-grams that occur zero times.

• Typically, probability of such n-grams is estimated by looking at

the probabilities of n-1 grams , n-2 grams, unigrams, etc.

Solution: Smoothing Techniques

50

Page 51: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Smoothing: Intuitively

Distribution of

probability mass

before smoothing

Training Data

All other

sentences

get 0 probability

All other

sentences

get non-zero

probability

Distribution of

probability mass

after smoothing

Training Data Smoothing

51

Page 52: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Smoothing: Intuitively

Distribution of

probability mass

before smoothing

Training Data

All other

sentences

get 0 probability

All other

sentences

get non-zero

probability

Distribution of

probability mass

after smoothing

Training Data Smoothing

52

Page 53: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Smoothing Techniques

Example smoothing techniques are: Witten-Bell Interpolated (WBI), Witten-

Bell Backoff (WBB), Natural Discounting (ND), Stupid-Backoff (SB), … 53

Psmooth( wi | wi-N+1) = i-1

PML( wi | wi-n+1 … wi-1)

γ(wi-N+1) Psmooth( wi | wi-n+2 … wi-1) i-1

If wi-n+1 … wi in training data

otherwise

Backoff techniques:

Page 54: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Smoothing Techniques

Psmooth( wi | wi-N+1) = i-1

PML( wi | wi-n+1 … wi-1)

γ(wi-N+1) Psmooth( wi | wi-n+2 … wi-1) i-1

If wi-n+1 … wi in training data

otherwise

Backoff techniques:

Psmooth( wi | wi-N+1) = λ PML( wi | wi-N+1) + ( 1 – λ ) Psmooth( wi | wi-N+2) i-1

Interpolation techniques:

i-1 i-1

Example smoothing techniques are: Witten-Bell Interpolated (WBI), Witten-

Bell Backoff (WBB), Natural Discounting (ND), Stupid-Backoff (SB), … 54

Page 55: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Summary: N-gram Language Model

Existing library implementing language models is SRILM:

http://www.speech.sri.com/projects/srilm/

Conditional probability only on previous n-1 words

P( wi | w1 … wi-1 ) ≈ P( wi | wi-n+1 … wi-1 ) ≈

n-1 words

#(wi-n+1 … wi-1 wi )

#(wi-n+1 … wi-1 )

+ Easy to implement

+ Fast to train (simply counting)

- Data sparsity issues

- Can be tricky to capture long distance relationships

55

Page 56: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Training phase

Program

Analysis

Language

model

Train

Language

Model

sentences

Learn Regularities

N-gram

56

Page 57: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Recurrent Neural Networks (RNNs)

57

0 0 1 0 0 … input word

… hidden layer

(state i)

… output layer

softmax classifier

probability

of “The”

probability

of “brown”

“The”

Page 58: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Recurrent Neural Networks (RNNs)

58

0 0 1 0 0 … input word

… hidden layer

(state i)

… output layer

softmax classifier

probability

of “The”

probability

of “brown”

1 0 0 0 0 …

“The” “quick”

Page 59: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Recurrent Neural Networks (RNNs)

59

0 0 1 0 0 … input word

… hidden layer

(state i)

… output layer

softmax classifier

probability

of “The”

probability

of “brown”

1 0 0 0 0 …

U

V

W

U

V W

“The” “quick”

Page 60: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Recurrent Neural Networks (RNNs)

60

0 0 1 0 0 … input word

… hidden layer

(state i)

… output layer

softmax classifier

probability

of “The”

probability

of “brown”

… …

U

V

W

U

V W

“The” “quick”

Matrices U, V and W

learned through back

propagation with time

and some unfolding.

Page 61: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Recurrent Neural Networks (RNNs)

61

0 0 1 0 0 … input word

… hidden layer

(state i)

… output layer

softmax classifier

probability

of “The”

probability

of “brown”

… …

U

V

W

U

V W

“The” “quick”

Matrices U, V and W

learned through back

propagation with time

and some unfolding.

Matrix U learns a low-

dimensional continuous

vector representation of

input words also referred

to as “word embedding”.

Page 62: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Word Embeddings

[Linguistic Regularities in Continuous

Space Word Representations, Mikolov

et.al., 2013 (NAACL-HLT-2013)]

62

Matrices U, V and W

learned through back

propagation with time

and some unfolding.

Matrix U learns a low-

dimensional continuous

vector representation of

input words also referred

to as “word embedding”.

Page 63: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Summary: RNNs

+ RNNs can learn dependencies beyond

the prior several words (LSTM networks)

+ Active research area

+ Learns continuous representation

of input vocabulary

- Slow to train

- Difficult to train

- Not interpretable

word(i) word(i+1)

U

V

W

state(i)

state(i-1)

hidden layer input word output words

http://rnnlm.org/

RNNLM-40 Download:

https://www.tensorflow.org/ 63

Page 64: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Training phase

Program

Analysis

Language

model

Train

Language

Model

sentences

Learn Regularities

N-gram

+

RNN

64

Page 65: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Camera camera = Camera.open();

camera.setDisplayOrientation(90);

camera.unlock();

MediaRecorder rec = new

MediaRecorder();

rec.setCamera(camera);

rec.setAudioSource(MediaRecorder.A

udioSource.MIC);

rec.setVideoSource(MediaRecorder.Vi

deoSource.DEFAULT);

rec.setOutputFormat(MediaRecorder.

OutputFormat.MPEG_4);

rec.setAudioEncoder(1);

rec.setVideoEncoder(3);

rec.setOutputFile("file.mp4");

...

?

?

?

Camera camera = Camera.open();

camera.setDisplayOrientation(90);

camera.unlock();

MediaRecorder rec = new

MediaRecorder();

rec.setCamera(camera);

rec.setAudioSource(MediaRecorder.A

udioSource.MIC);

rec.setVideoSource(MediaRecorder.Vi

deoSource.DEFAULT);

rec.setOutputFormat(MediaRecorder.

OutputFormat.MPEG_4);

rec.setAudioEncoder(1);

rec.setVideoEncoder(3);

rec.setOutputFile("file.mp4");

...

incomplete program

completed

program

Training phase

Program

Analysis

Train

Language

Model

Language

model sentences

N-gram

+

RNN

The SLANG System

65

Page 66: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Code Synthesis

smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); ? {smsMgr, list} // (Hole H1) } else { ? {smsMgr, message} // (Hole H2) }

66

Page 67: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Code Synthesis

smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); ? {smsMgr, list} // (Hole H1) } else { ? {smsMgr, message} // (Hole H2) }

Abstract object smsMgr:

getDefaultresult divideMessage H1 getDefaultresult H2

67

Page 68: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Code Synthesis

smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); ? {smsMgr, list} // (Hole H1) } else { ? {smsMgr, message} // (Hole H2) }

Abstract object smsMgr:

getDefaultresult divideMessage H1 getDefaultresult H2

68

Page 69: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Code Synthesis

smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); ? {smsMgr, list} // (Hole H1) } else { ? {smsMgr, message} // (Hole H2) }

Abstract object list:

divideMessageresult H1

Abstract object smsMgr:

getDefaultresult divideMessage H1 getDefaultresult H2

69

Page 70: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Code Synthesis

smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); ? {smsMgr, list} // (Hole H1) } else { ? {smsMgr, message} // (Hole H2) }

Abstract object list:

divideMessageresult H1

Abstract object smsMgr:

getDefaultresult divideMessage H1 getDefaultresult H2

70

Page 71: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Code Synthesis

smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); ? {smsMgr, list} // (Hole H1) } else { ? {smsMgr, message} // (Hole H2) }

Abstract object list:

divideMessageresult H1

Abstract object smsMgr:

getDefaultresult divideMessage H1 getDefaultresult H2

Abstract object message: length divideMessageparam1

length H2

71

Page 72: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Code Synthesis

smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); ? {smsMgr, list} // (Hole H1) } else { ? {smsMgr, message} // (Hole H2) }

Abstract object list:

divideMessageresult H1

Abstract object smsMgr:

getDefaultresult divideMessage H1 getDefaultresult H2

Abstract object message: length divideMessageparam1

length H2

72

Page 73: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Code Synthesis

arg𝑚𝑎𝑥 P(getDefaultresult divideMessage H1) H1

getDefaultresult divideMessage H1

Abstract object smsMgr:

Page 74: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Code Synthesis

arg𝑚𝑎𝑥 P(getDefaultresult divideMessage H1) H1

getDefaultresult divideMessage sendMultipartTextMessage 0.0033

Completion probability Abstract object smsMgr:

Page 75: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Code Synthesis

Completion probability

getDefaultresult divideMessage sendMultipartTextMessage 0.0033

getDefaultresult divideMessage sendTextMessage 0.0016

Abstract object smsMgr:

Page 76: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

getDefaultresult divideMessage sendMultipartTextMessage 0.0033

getDefaultresult divideMessage sendTextMessage 0.0016

getDefaultresult sendTextMessage 0.0073

getDefaultresult sendMultipartTextMessage 0.0010

divideMessageresult sendMultipartTextMessageparam3 0.0821

length length 0.0132

length split 0.0080

length sendTextMessageparam3 0.0017

Completion probability

Statistical Code Synthesis

Abstract object smsMgr:

Page 77: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

getDefaultresult divideMessage sendMultipartTextMessage 0.0033

getDefaultresult divideMessage sendTextMessage 0.0016

getDefaultresult sendTextMessage 0.0073

getDefaultresult sendMultipartTextMessage 0.0010

divideMessageresult sendMultipartTextMessageparam3 0.0821

length length 0.0132

length split 0.0080

length sendTextMessageparam3 0.0017

Completion probability

Statistical Code Synthesis

Abstract object smsMgr:

Page 78: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

getDefaultresult divideMessage sendMultipartTextMessage 0.0033

getDefaultresult divideMessage sendTextMessage 0.0016

getDefaultresult sendTextMessage 0.0073

getDefaultresult sendMultipartTextMessage 0.0010

divideMessageresult sendMultipartTextMessageparam3 0.0821

length length 0.0132

length split 0.0080

length sendTextMessageparam3 0.0017

Completion probability

Not a feasible solution:

completions disagree on

selected method

The solution must satisfy

program constraints

Statistical Code Synthesis

Abstract object smsMgr:

Page 79: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

getDefaultresult divideMessage sendMultipartTextMessage 0.0033

getDefaultresult divideMessage sendTextMessage 0.0016

getDefaultresult sendTextMessage 0.0073

getDefaultresult sendMultipartTextMessage 0.0010

divideMessageresult sendMultipartTextMessageparam3 0.0821

length length 0.0132

length split 0.0080

length sendTextMessageparam3 0.0017

Completion probability

Statistical Code Synthesis

Abstract object smsMgr:

Page 80: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

getDefaultresult divideMessage sendMultipartTextMessage 0.0033

getDefaultresult divideMessage sendTextMessage 0.0016

getDefaultresult sendTextMessage 0.0073

getDefaultresult sendMultipartTextMessage 0.0010

divideMessageresult sendMultipartTextMessageparam3 0.0821

length length 0.0132

length split 0.0080

length sendTextMessageparam3 0.0017

Completion probability

Statistical Code Synthesis

Abstract object smsMgr:

Page 81: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); smsMgr.sendMultipartTextMessage(...list...); } else { smsMgr.sendTextMessage(...message...); }

Statistical Code Synthesis

81

Page 82: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Camera camera = Camera.open();

camera.setDisplayOrientation(90);

camera.unlock();

MediaRecorder rec = new

MediaRecorder();

rec.setCamera(camera);

rec.setAudioSource(MediaRecorder.A

udioSource.MIC);

rec.setVideoSource(MediaRecorder.Vi

deoSource.DEFAULT);

rec.setOutputFormat(MediaRecorder.

OutputFormat.MPEG_4);

rec.setAudioEncoder(1);

rec.setVideoEncoder(3);

rec.setOutputFile("file.mp4");

...

?

?

?

Camera camera = Camera.open();

camera.setDisplayOrientation(90);

camera.unlock();

MediaRecorder rec = new

MediaRecorder();

rec.setCamera(camera);

rec.setAudioSource(MediaRecorder.A

udioSource.MIC);

rec.setVideoSource(MediaRecorder.Vi

deoSource.DEFAULT);

rec.setOutputFormat(MediaRecorder.

OutputFormat.MPEG_4);

rec.setAudioEncoder(1);

rec.setVideoEncoder(3);

rec.setOutputFile("file.mp4");

... Completion phase

Program

Analysis Combine Query

sentences

with holes

completed

sentences incomplete program

completed

program

Training phase

Program

Analysis

Train

Languag

e Model

Language

model sentences

1M Java

methods

~700MB

~100MB

84 testing

samples

Correct completion

in top 3 for ~90% of

cases

The SLANG System

82

Page 83: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

• Connection to PL • decide what to ``compile’’ into the word

• Program Analysis need not be sound

• N-gram Models

• Easy and cheap to train • Important to deal with sparsity via smoothing • Can be tricky to capture long distance relationships

• Recurrent Networks • Can capture longer distance relationships • More expensive to train • Experimentally similar to language models

Lessons

83

Page 84: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

1

no alias analysis

with alias analysis

1% 10% 100%

Precision vs. % of Data used

0

0.5

Static Analysis Benefit = 10x more Data

Importance of Static Analysis

Page 85: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Models of Full Language

• So far:

– N-gram and RNNs

– Modelling specific elements (APIs)

• How can be build statistical language models of full programming language?

85

Page 86: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

• Motivation • Potential applications

• Statistical Language Models

• N-gram, Recurrent Networks, PCFGs • Application: code completion

• Graphical Models

• Markov Networks, Conditional Random Fields • Inference & Learning in Markov Networks • Application: predicting names and types

• Learning Features for Programs • Combining Program Synthesis + Machine Learning

Tutorial Outline

86

Page 87: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Applications

Natural Language Processing

the dog saw a man

in the park

S

NP VP

Det N V NP PP

Det N

Det N P NNP

Computer Vision Programming

fun chunkData(e,t)

var n = [];

var r = e.length;

var i = 0;

for(; i < r; i += t)

return n;

87

Page 88: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Facts to be predicted are dependent

Millions of candidate choices

Must quickly learn from huge codebases

Prediction should be fast (real time)

Challenges

88

Page 89: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Phrase the problem of predicting program facts as

Structured Prediction for Programs

Key Idea

89

Page 90: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Structured Prediction for Programs [V. Raychev, M. Vechev, A. Krause, ACM POPL’15]

Program

Analysis

Graphical

Models

Inference

Learning

First connection between Programs and Conditional Random Fields

Knowledge

Reasoning

Page 91: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation

alias analysis typestate analysis

N-gram language model

SVM Structured SVM

Neural Networks

Sequences

(sentences) Trees

Translation Table

control-flow

analysis scope analysis

argmax P(y | x)

y ∈ Ω

http://www.nice2predict.org/

http://www.srl.inf.ethz.ch/

More information

and tutorials at:

Greedy

MAP inference

Graphical Models (CRFs)

Machine Learning for Programs

91

Log-linear Models Markov Networks

Page 92: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation

alias analysis typestate analysis

N-gram language model

SVM Structured SVM

Neural Networks

Sequences

(sentences) Trees

Translation Table

Log-linear Models

control-flow

analysis scope analysis

argmax P(y | x)

y ∈ Ω

http://www.nice2predict.org/

http://www.srl.inf.ethz.ch/

More information

and tutorials at:

Greedy

MAP inference

Graphical Models (CRFs)

Machine Learning for Programs

92

Markov Networks

Page 93: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation

alias analysis typestate analysis

N-gram language model

SVM Structured SVM

Neural Networks

Sequences

(sentences) Trees

Translation Table

control-flow

analysis scope analysis

argmax P(y | x)

y ∈ Ω

http://www.nice2predict.org/

http://www.srl.inf.ethz.ch/

More information

and tutorials at:

Greedy

MAP inference

Graphical Models (CRFs)

Machine Learning for Programs

Log-linear Models Markov Networks

Page 94: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Defining a Global Probabilistic Model

Joint probability distribution

P(A, B, C, D)

94

Page 95: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Defining a Global Probabilistic Model

Joint probability distribution

A B C D

0 0 0 0 P(a0, b0, c0, d0) = 0.04

0 0 0 1 P(a0, b0, c0, d1) = 0.04

0 0 1 0 P(a0, b0, c1, d0) = 0.04

0 0 1 1 P(a0, b0, c1, d1) = 4.1 10-6

… … … … …

P(A, B, C, D)

Table of joint probabilities

95

Page 96: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Defining a Global Probabilistic Model

Joint probability distribution

A B C D

0 0 0 0 P(a0, b0, c0, d0) = 0.04

0 0 0 1 P(a0, b0, c0, d1) = 0.04

0 0 1 0 P(a0, b0, c1, d0) = 0.04

0 0 1 1 P(a0, b0, c1, d1) = 4.1 10-6

… … … … …

P(A, B, C, D)

Table of joint probabilities 𝑛 variables → 𝑂 2𝑛 entries!

96

Page 97: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Joint probability distribution

P(A, B, C, D) = F(A,B,C,D)

F(A, B, C, D) = 1 (A,B) 2 (B,C) 3 (C,D) 4 (D,A)

Define the global model by combining

a set of local interactions between variables

97

Defining a Global Probabilistic Model

Page 98: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Factors

• A factor (or potential) is a function from a set of random variables D to a real number R

: D R

• The set of variables D is the scope of the factor – we are typically concerned with non-negative factors

• Intuition: captures affinity, agreement, compatibility of the variables in D

98

Page 99: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Factors: Example

A B value

0 0 30

0 1 5

1 0 1

1 1 10

• Example factors – 1 (0,0) = 30 says we believe A and B agree on 0 with belief 30

– 3 (C,D) says that C and D argue all the time

1 (A,B)

99

Page 100: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Factors: Example

A B value

0 0 30

0 1 5

1 0 1

1 1 10

• Example factors – 1 (0,0) = 30 says we believe A and B agree on 0 with belief 30

– 3 (C,D) says that C and D argue all the time

1 (A,B)

B C value

0 0 100

0 1 1

1 0 1

1 1 100

2 (B,C)

C D value

0 0 1

0 1 100

1 0 100

1 1 1

3 (C,D)

D A value

0 0 100

0 1 1

1 0 1

1 1 100

4 (D,A)

100

Page 101: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Factors: Example

A B value

0 0 30

0 1 5

1 0 1

1 1 10

1 (A,B)

B C value

0 0 100

0 1 1

1 0 1

1 1 100

2 (B,C)

C D value

0 0 1

0 1 100

1 0 100

1 1 1

3 (C,D)

D A value

0 0 100

0 1 1

1 0 1

1 1 100

4 (D,A)

Key point: (Local) factors have no probabilistic interpretation

• Example factors – 1 (0,0) = 30 says we believe A and B agree on 0 with belief 30

– 3 (C,D) says that C and D argue all the time

Page 102: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Defining a Global Probabilistic Model

Joint probability distribution

P(A, B, C, D) = F(A,B,C,D)

F(A, B, C, D) = 1 (A,B) 2 (B,C) 3 (C,D) 4 (D,A)

102

Page 103: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Defining a Global Probabilistic Model

P(A, B, C, D) = F(A,B,C,D) / Z(A,B,C,D)

F(A, B, C, D) = 1 (A,B) 2 (B,C) 3 (C,D) 4 (D,A)

Z(A,B,C,D) = F(a, b, c, d) a A, b B

c C, d D

Partition function

Joint probability distribution

103

Page 104: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

P(A,B,C,D)

A B C D Unnormalized (F)

Normalized (P = F/Z)

0 0 0 0 300,000 0.04

0 0 0 1 300,000 0.04

0 0 1 0 300,000 0.04

0 0 1 1 30 4.1 10-6

… … … … … …

Z(A,B,C,D) = 7,201,840

104

A B value

0 0 30

0 1 5

1 0 1

1 1 10

1 (A,B)

B C value

0 0 100

0 1 1

1 0 1

1 1 100

2 (B,C)

C D value

0 0 1

0 1 100

1 0 100

1 1 1

3 (C,D)

D A value

0 0 100

0 1 1

1 0 1

1 1 100

4 (D,A)

Page 105: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

P(A,B,C,D)

A B C D Unnormalized (F)

Normalized (P = F/Z)

0 0 0 0 300,000 0.04

0 0 0 1 300,000 0.04

0 0 1 0 300,000 0.04

0 0 1 1 30 4.1 10-6

… … … … … …

Z(A,B,C,D) = 7,201,840

We can now answer probability queries on P(A, B, C, D):

For example: P(A = 0, B = 0) = P(0, 0, C, D) ~= 0.125 c C

d D

105

Page 106: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Let X = {X1, …, Xn} be a set of random variables and D1, … , Dm X

Then, a Markov Network is defined as:

P(X1, X2, …, Xn) = 1 (D1) 2 (D2) … m (Dm)

Z(X1, X2, …, Xn)

Z(X1, X2, …, Xn) = 1 (D1) 2 (D2) … m (Dm)

Markov Network

106

Page 107: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Graphs vs. Probability Distributions

We will next relate graphs and probability distributions. This is

important for two reasons.

First, it allows us to determine properties (e.g., independence of

variables) of a probability distribution directly from the graph.

Second, it allows us to answer queries (e.g., MAP inference) on

the probability distribution by working with the graph.

107

Page 108: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Graphs vs. Probability Distributions

P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z

108

Which variables are independent?

Page 109: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Graphs vs. Probability Distributions

A B

C D

P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z

1 (A, B)

2 (B, D)

3 (D, C)

4 (C, A)

Intuitively, variables are nodes and factors introduce edges

109

Page 110: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

A distribution P equipped with a set of factors 1 (D1) , … , m (Dm)

factorizes over a graph G if each Di is a complete subgraph of G

Graph Factorization

110

Page 111: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

A distribution P equipped with a set of factors 1 (D1) , … , m (Dm)

factorizes over a graph G if each Di is a complete subgraph of G

A B

C D

P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z

Graph Factorization

111

Page 112: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

A distribution P equipped with a set of factors 1 (D1) , … , m (Dm)

factorizes over a graph G if each Di is a complete subgraph of G

A B

C D

P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z

Here, P factorizes over G

Graph Factorization

112

Page 113: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

A distribution P equipped with a set of factors 1 (D1) , … , m (Dm)

factorizes over a graph G if each Di is a complete subgraph of G

A B

C D

P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z

Here, P factorizes over G

A B

C D

P(A, B, C, D) = 1 (A, B, C) 2 (B, D, C) / Z

Graph Factorization

113

Page 114: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

A distribution P equipped with a set of factors 1 (D1) , … , m (Dm)

factorizes over a graph G if each Di is a complete subgraph of G

A B

C D

P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z

Here, P factorizes over G

A B

C D

P(A, B, C, D) = 1 (A, B, C) 2 (B, D, C) / Z

Here, P does not factorize over G

Graph Factorization

114

Page 115: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

A distribution P equipped with a set of factors 1 (D1) , … , m (Dm)

factorizes over a graph G if each Di is a complete subgraph of G

A B

C D

P(A, B, C, D) = 1 (A, B) 2 (A, C) 3 (A, D)

4 (B, C) 5 (B, D) 6 (C, D) / Z

A B

C D

P(A, B, C, D) = 1 (A, B, C, D) / Z

Graph Factorization

115

Page 116: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Graphs vs. Probabilities

There is an important connection between a probability distribution that factorizes over a graph and the properties of the graph.

In particular, we can discover various independence properties of the probabilistic distribution directly from the graph

116

Page 117: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Two sets of disjoint nodes A and B in a graph G are separated by a set S, if

every path between A and B goes through a node in S. A and B are also

separated if no path between A and B exists regardless of set S.

The Global Markov Property says that if A and B are separated by S, then

A B | S that is P(A, B | S) = P(A | S) P(B | S)

Graph Separation

117

Page 118: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

A B

C D

E F

G

Graph Separation: Example

118

Page 119: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Example of global independencies found in the above graph:

{A} {G} | {D}

{B} {F} | {E,G}

A B

C D

E F

G

Graph Separation: Example

119

P(A, G | D) = P(A | D) P(G | D)

Page 120: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Graphical Models: So far

So far we learned what a Markov Network is, what it means for a probability distribution to factor over a graph and how to extract information about independence of variables from the graph.

So far we have been discussing the joint probability distribution P(X1,…,Xn). We next focus on conditional distributions.

120

Page 121: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

So Far: Generative Models

121

Page 122: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Next: Discriminative Models

Predict unknown facts given some known facts

122

Page 123: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Next: Discriminative Models

Predict unknown facts given some known facts

✘ ✘ ✘

✘ do not model distribution over known facts

Page 124: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Conditional Random Field

Let X Y be a set of random variables where X = {X1,…, Xn}, Y = {Y1,…,

Yn}. Here, X are the observed variables and Y are the target variables.

Let D1, D2, … , Dm X Y where Di X

A Conditional Random Field is defined as:

P(Y1,…,Yn X1,…, Xn) = 1 (D1) 2 (D2) … m (Dm)

Z(X1, …, Xn)

Note that the Z function ranges over variables in X only.

Y Z(X1,…, Xn) = 1 (D1) 2 (D2) … m (Dm)

[J. Lafferty, A. McCallum, F. Pereira, ICML 2001]

124

Page 125: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Key advantage: avoids encoding distribution over the variables X. Thus,

we can use many observed variables with complex dependencies without

needing to model any joint distributions over them.

P(Y1,…,Yn X1,…, Xn) = 1 (D1) 2 (D2) … m (Dm)

Z(X1, …, Xn)

Z(X1,…, Xn) = 1 (D1) 2 (D2) … m (Dm)

Conditional Random Field [J. Lafferty, A. McCallum, F. Pereira, ICML 2001]

Y

125

Page 126: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Conditional Random Field: Example Text Analytics (Named Entity Recognition)

Problem Statement:

Given a sentence, label each word with whether it is a person or a location

126

Page 127: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Conditional Random Field: Example Text Analytics (Named Entity Recognition)

Problem Statement:

Given a sentence, label each word with whether it is a person or a location

5 Labels: B-per, I-per, B-loc, I-loc, O

127

Page 128: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Conditional Random Field: Example Text Analytics (Named Entity Recognition)

Problem Statement:

Given a sentence, label each word with whether it is a person or a location

5 Labels: B-per, I-per, B-loc, I-loc, O

Mr. Smith came today to Toronto Given:

128

Page 129: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Conditional Random Field: Example Text Analytics (Named Entity Recognition)

Problem Statement:

Given a sentence, label each word with whether it is a person or a location

Mr. Smith came today to Toronto

B-per O O O B-loc I-per

Given:

Predict:

5 Labels: B-per, I-per, B-loc, I-loc, O

129

Page 130: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Conditional Random Field: Example Text Analytics (Named Entity Recognition)

Problem Statement:

Given a sentence, label each word with whether it is a person or a location

5 Labels: B-per, I-per, B-loc, I-loc, O

Mr. Smith came today to Toronto

X1 X2 X3 X4 X6 X7

130

Page 131: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Conditional Random Field: Example Text Analytics (Named Entity Recognition)

Problem Statement:

Given a sentence, label each word with whether it is a person or a location

5 Labels: B-per, I-per, B-loc, I-loc, O

Mr. Smith came today to Toronto

Y1 Y2 Y3 Y4 Y6 Y7

X1 X2 X3 X4 X6 X7

131

Page 132: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Conditional Random Field: Example Text Analytics (Named Entity Recognition)

Problem Statement:

Given a sentence, label each word with whether it is a person or a location

5 Labels: B-per, I-per, B-loc, I-loc, O

Two kinds of factors for each word Xt 1t (Yt , Yt+1) 2

t (Yt , X1 , … , Xt)

Captures relationship between

neighboring predictions

Relationship between prediction

and the neighbors of the word

Mr. Smith came today to Toronto

X1 X2 X3 X4 X6 X7

Y1 Y2 Y3 Y4 Y6 Y7

132

Page 133: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Conditional Random Field: Example Text Analytics (Named Entity Recognition)

Problem Statement:

Given a sentence, label each word with whether it is a person or a location

5 Labels: B-per, I-per, B-loc, I-loc, O

Two kinds of factors for each word Xt 1t (Yt , Yt+1) 2

t (Yt , X1 , … , Xt)

Mr. Smith came today to Toronto

Y1 Y2 Y3 Y4 Y6 Y7

X1 X2 X3 X4 X6 X7

133

Page 134: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Conditional Random Field: Example Text Analytics (Named Entity Recognition)

Problem Statement:

Given a sentence, label each word with whether it is a person or a location

P(Y1,…,Y7 X1,…, X7) =

11 (Y1 , Y2) … 7

6 (Y6 , Y7) 21 (Y1,X1,X2) … 2

7 (Y7,X6,X7)

11 (Y1 , Y2) … 7

6 (Y6 , Y7) 21 (Y1,X1,X2) … 2

7 (Y7,X6,X7)

Y1,Y2,Y3,Y4,Y5,Y6, Y7

Mr. Smith came today to Toronto

Y1 Y2 Y3 Y4 Y6 Y7

X1 X2 X3 X4 X6 X7

134

Page 135: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Conditional Random Field: Example Text Analytics (Named Entity Recognition)

Problem Statement:

Given a sentence, label each word with whether it is a person or a location

P(Y1,…,Y7 X1,…, X7) =

11 (Y1 , Y2) … 7

6 (Y6 , Y7) 21 (Y1,X1,X2) … 2

7 (Y7,X6,X7)

11 (Y1 , Y2) … 7

6 (Y6 , Y7) 21 (Y1,X1,X2) … 2

7 (Y7,X6,X7)

Y1,Y2,Y3,Y4,Y5,Y6, Y7

Mr. Smith came today to Toronto

Y1 Y2 Y3 Y4 Y6 Y7

X1 X2 X3 X4 X6 X7

B-per I-per B-loc O O O

135

Page 136: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Learning from Data

We next look at equivalent representation of Markov Networks and Conditional Random Fields that are amenable to learning from data (e.g., log-linear form).

136

Page 137: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Another representation of a Markov Network is that of a log-linear model

Here, we have a set of feature functions 𝑓1(D1),…, 𝑓𝑘(D𝑘) where each D𝑖 is a

complete subgraph of G. An 𝑓𝑖(D𝑖) can return any value, i.e., can be negative

Z(X1, …, Xn) = 𝑒𝑥𝑝− 𝑤𝑖 𝑓𝑖(D𝑖)𝑘

𝑖 =1

P(X1,…,Xn) =𝑒𝑥𝑝−𝑤1 𝑓

1D1 … 𝑒𝑥𝑝−𝑤𝑘 𝑓

𝑘D𝑘

Z(X1,…,Xn)

Any Markov Network whose factors are positive can be converted to a log-linear model

Log-Linear Models

137

Page 138: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Log-linear models have few benefits over factors. They:

1. Make certain relationships more explicit

2. Offer a much more compact representation for many distributions,

especially for variables with large domains (e.g., names in JSNice)

3. They are useful for learning where we are given the feature functions and

the learning phase simply figures out the weights.

Z(X1, …, Xn) = 𝑒𝑥𝑝− 𝑤𝑖 𝑓𝑖(D𝑖)𝑘

𝑖 =1

Log-Linear Representation: Benefits

P(X1,…,Xn) =𝑒𝑥𝑝−𝑤1 𝑓

1D1 … 𝑒𝑥𝑝−𝑤𝑘 𝑓

𝑘D𝑘

Z(X1,…,Xn)

Page 139: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Feature Function: Example For example, suppose that for the factor 1 :

1 (A, B) = 100 if A = B

1 otherwise

If A and B range over m and n values respectively, we have m n possible

values to store in order to keep this factor.

140

A B value

0 0 100

0 1 1

1 0 1

1 1 100

2 (A,B)

Page 140: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Feature Function: Example For example, suppose that for the factor 1 :

1 (A, B) = 100 if A = B

1 otherwise

If A and B range over m and n values respectively, we have m n possible

values to store in order to keep this factor.

Instead, we can use a single (indicator) feature function 𝑓1 A, B where 𝑓1 A, B = 1 if A = B and 𝑤1 = -4.61

0 else

141

A B value

0 0 100

0 1 1

1 0 1

1 1 100

2 (A,B)

Indicator functions can be directly extracted from data during learning (later)

Page 141: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Graphical Models: So far

In the last segment we learned about log-linear models and indicator functions.

Log-linear models are important because they allow to capture factors more concisely, the indicator functions can be directly extracted from data, and the weights can be learned (to fit the optimization objective, discussed later).

142

Page 142: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Hands-on

143

Page 143: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

144

function chunkData(e, t) var n = []; var r = e.length; var i = 0; for (; i < r; i += t) if (i + t < r) n.push(e.substring(i, i + t)); else n.push(e.substring(i, r)); return n;

Structured Prediction for Programs

Page 144: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

function chunkData(e, t) var n = []; var r = e.length; var i = 0; for (; i < r; i += t) if (i + t < r) n.push(e.substring(i, i + t)); else n.push(e.substring(i, r)); return n;

t r i

length

Unknown

facts:

Known

facts:

i t

r length

145

Structured Prediction for Programs

Page 145: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

t r i

length

Unknown

facts:

Known

facts:

i t

r length

i t w

i step 0.5

j step 0.4

i r w

i len 0.6

i length 0.3

r length w

length length 0.5

len length 0.3

i

t

r length

function chunkData(e, t) var n = []; var r = e.length; var i = 0; for (; i < r; i += t) if (i + t < r) n.push(e.substring(i, i + t)); else n.push(e.substring(i, r)); return n;

Structured Prediction for Programs

Page 146: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

t r i

length

Unknown

facts:

Known

facts:

i t

r length

i t w

i step 0.5

j step 0.4

i r w

i len 0.6

i length 0.3

r length w

length length 0.5

len length 0.3

i

t

r length

argmax 𝒘𝑇𝒇(𝑖, 𝑡, 𝑟, 𝑙𝑒𝑛𝑔𝑡ℎ)

function chunkData(e, t) var n = []; var r = e.length; var i = 0; for (; i < r; i += t) if (i + t < r) n.push(e.substring(i, i + t)); else n.push(e.substring(i, r)); return n;

Structured Prediction for Programs

Page 147: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

t r i

length

Unknown

facts:

Known

facts:

i t

r length

i t w

i step 0.5

j step 0.4

i r w

i len 0.6

i length 0.3

r length w

length length 0.5

len length 0.3

i

t

r length i

len

step

argmax 𝒘𝑇𝒇(𝑖, 𝑡, 𝑟, 𝑙𝑒𝑛𝑔𝑡ℎ)

function chunkData(e, t) var n = []; var r = e.length; var i = 0; for (; i < r; i += t) if (i + t < r) n.push(e.substring(i, i + t)); else n.push(e.substring(i, r)); return n;

Structured Prediction for Programs

Page 148: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

t r i

length

Unknown

facts:

Known

facts:

i t

r length

i t w

i step 0.5

j step 0.4

i r w

i len 0.6

i length 0.3

r length w

length length 0.5

len length 0.3

function chunkData(str, step) var colNames = []; var len = str.length; var i = 0; for (; i < len; i += step) if (i + step < len) colNames.push(str.substring(i, i + step)); else colNames.push(str.substring(i, len)); return colNames;

i

t

r length i

len

step

argmax 𝒘𝑇𝒇(𝑖, 𝑡, 𝑟, 𝑙𝑒𝑛𝑔𝑡ℎ)

function chunkData(e, t) var n = []; var r = e.length; var i = 0; for (; i < r; i += t) if (i + t < r) n.push(e.substring(i, i + t)); else n.push(e.substring(i, r)); return n;

Structured Prediction for Programs

Page 149: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

• Motivation • Potential applications

• Statistical Language Models

• N-gram, Recurrent Networks, PCFGs • Application: code completion

• Graphical Models

• Markov Networks, Conditional Random Fields • Inference & Learning in Markov Networks • Application: predicting names and types

• Learning Features for Programs • Combining Program Synthesis + Machine Learning

Tutorial Outline

150

Page 150: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Queries: MAP vs. Max Marginals

MAP Inference: (𝑦1, … , 𝑦𝑛)MAP = argmax 𝑃(𝑦1, … , 𝑦𝑛) 𝑦1, … , 𝑦𝑛

Max Marginals: (𝑦1, … , 𝑦𝑛)ML = (𝑦1ML, … , 𝑦𝑛ML)

𝑦1ML = argmax 𝑃(𝑦1)

𝑦1

𝑦nML = argmax 𝑃(𝑦𝑛)

𝑦𝑛 ……

VS.

151

Page 151: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Queries: MAP vs. Max Marginals

MAP Inference: (𝑦1, … , 𝑦𝑛)MAP = argmax 𝑃(𝑦1, … , 𝑦𝑛) 𝑦1, … , 𝑦𝑛

Max Marginals: (𝑦1, … , 𝑦𝑛)ML = (𝑦1ML, … , 𝑦𝑛ML)

𝑦1ML = argmax 𝑃(𝑦1)

𝑦1

𝑦nML = argmax 𝑃(𝑦𝑛)

𝑦𝑛 ……

VS.

A B val

0 0 0.25

0 1 0.3

1 0 0.10

1 1 0.35

Consider the following probability distribution:

152

Page 152: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Queries: MAP vs. Max Marginals

MAP Inference: (𝑦1, … , 𝑦𝑛)MAP = argmax 𝑃(𝑦1, … , 𝑦𝑛) 𝑦1, … , 𝑦𝑛

Max Marginals: (𝑦1, … , 𝑦𝑛)ML = (𝑦1ML, … , 𝑦𝑛ML)

𝑦1ML = argmax 𝑃(𝑦1)

𝑦1

𝑦nML = argmax 𝑃(𝑦𝑛)

𝑦𝑛 ……

VS.

A B val

0 0 0.25

0 1 0.3

1 0 0.10

1 1 0.35

Consider the following probability distribution:

A argmax 𝑃(A)

argmax 𝑃(A, B) A, B B

argmax 𝑃(B) 153

Page 153: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Queries: MAP vs. Max Marginals

MAP Inference: (𝑦1, … , 𝑦𝑛)MAP = argmax 𝑃(𝑦1, … , 𝑦𝑛) 𝑦1, … , 𝑦𝑛

Max Marginals: (𝑦1, … , 𝑦𝑛)ML = (𝑦1ML, … , 𝑦𝑛ML)

𝑦1ML = argmax 𝑃(𝑦1)

𝑦1

𝑦nML = argmax 𝑃(𝑦𝑛)

𝑦𝑛 ……

VS.

A B val

0 0 0.25

0 1 0.3

1 0 0.10

1 1 0.35

Consider the following probability distribution:

(1,1) MAP (0,1) ML

A argmax 𝑃(A)

argmax 𝑃(A, B) A, B B

argmax 𝑃(B) 154

Page 154: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Queries: MAP vs. Max Marginals

MAP Inference: (𝑦1, … , 𝑦𝑛)MAP = argmax 𝑃(𝑦1, … , 𝑦𝑛) 𝑦1, … , 𝑦𝑛

Max Marginals: (𝑦1, … , 𝑦𝑛)ML = (𝑦1ML, … , 𝑦𝑛ML)

𝑦1ML = argmax 𝑃(𝑦1)

𝑦1

𝑦nML = argmax 𝑃(𝑦𝑛)

𝑦𝑛 ……

VS.

Baseline 25.3%

Max Marginals 54.1%

MAP Inference 63.4%

155

Page 155: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

MAP Inference

Bad news: computing the argmax is in general NP-hard (Max-SAT)

Good news: many approximate methods exists • Gibbs sampling

• Expectation Maximization

• Variational methods

• Elimination Algorithm

• Junction-Tree algorithm

• Loopy Belief Propagation

156

Page 156: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

MAP Inference

Bad news: computing the argmax is in general NP-hard (Max-SAT)

Good news: many approximate methods exists • Gibbs sampling

• Expectation Maximization

• Variational methods

• Elimination Algorithm

• Junction-Tree algorithm

• Loopy Belief Propagation

157

Bad news: still too slow for learning

𝑂(𝑁 ∙ 𝑀𝑁)

𝑁: number of neighbors 𝑀: number of states

Page 157: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

MAP Inference

Bad news: computing the argmax is in general NP-hard (Max-SAT)

Good news: many approximate methods exists • Gibbs sampling

• Expectation Maximization

• Variational methods

• Elimination Algorithm

• Junction-Tree algorithm

• Loopy Belief Propagation

158

Bad news: still too slow for learning

𝑂(𝑁 ∙ 𝑀𝑁)

𝑁: number of neighbors 𝑀: number of states

We designed an approximate one to fit our needs , i.e., deal with many values

Page 158: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Structured SVM Training [N. Ratliff, J. Bagnell, M. Zinkevich, AISTATS 2007]

Given a data set: D = { xj , yj }j=1..n learn weights 𝒘𝑇

𝑃 𝑦 𝑥 =1

𝑍(𝑥)exp (𝒘𝑇𝒇(𝑦, 𝑥))

159

Page 159: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Structured SVM Training

Optimization objective (max-margin training):

∀j ∀y Σ 𝑤ifi(x(j),y(j)) ≥ Σ 𝑤ifi(x(j),y) + 𝚫(y,y(j))

[N. Ratliff, J. Bagnell, M. Zinkevich, AISTATS 2007]

Given a data set: D = { xj , yj }j=1..n learn weights 𝒘𝑇

𝑃 𝑦 𝑥 =1

𝑍(𝑥)exp (𝒘𝑇𝒇(𝑦, 𝑥))

160

Page 160: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Structured SVM Training

Optimization objective (max-margin training):

∀j ∀y Σ 𝑤ifi(x(j),y(j)) ≥ Σ 𝑤ifi(x(j),y) + 𝚫(y,y(j))

[N. Ratliff, J. Bagnell, M. Zinkevich, AISTATS 2007]

Given a data set: D = { xj , yj }j=1..n learn weights 𝒘𝑇

for all samples

Given prediction is better than any other prediction by a margin

𝑃 𝑦 𝑥 =1

𝑍(𝑥)exp (𝒘𝑇𝒇(𝑦, 𝑥))

161

Page 161: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Structured SVM Training

Optimization objective (max-margin training):

∀j ∀y Σ 𝑤ifi(x(j),y(j)) ≥ Σ 𝑤ifi(x(j),y) + 𝚫(y,y(j))

[N. Ratliff, J. Bagnell, M. Zinkevich, AISTATS 2007]

Given a data set: D = { xj , yj }j=1..n learn weights 𝒘𝑇

for all samples

Given prediction is better than any other prediction by a margin

Avoids expensive computation of the partition function 𝑍(𝑥)

𝑃 𝑦 𝑥 =1

𝑍(𝑥)exp (𝒘𝑇𝒇(𝑦, 𝑥))

162

Page 162: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Learning Phase

SSVM

learning

program

analysis

transform

Prediction Phase

program

analysis

var n = []; var r = e.length; var i = 0; for (; i < r; i += t) if (i + t < r) n.push(e.subs(i, i + t)); else n.push(e.subs(i, r)); return n;

var colNames = []; var len = str.length; var i = 0; for (; i < len; i += step) if (i + step < len) colNames.push(str.subs(i, i + step)); else colNames.push(str.subs(i, len)); return colNames;

MAP

inference

alias,call

analysis 7M feature functions for names

70K feature functions for types

max-margin

training

150MB

30 nodes, 400 edges

Time: milliseconds

Names: 63% Types: 81% (helps typechecking)

Conditional Random Field

𝑃 𝑦 𝑥

Structured Prediction for Programs (V. Raychev, M. V., A. Krause, ACM POPL’15)

164

Page 163: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

• Motivation • Potential applications

• Statistical Language Models

• N-gram, Recurrent Networks, PCFGs • Application: code completion

• Graphical Models

• Markov Networks, Conditional Random Fields • Inference & Learning in Markov Networks • Application: predicting names and types

• Learning Features for Programs • Combining Program Synthesis + Machine Learning

Tutorial Outline

165

Page 164: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Models of Full Language

• So far:

– N-gram and RNNs

– Modelling specific elements (APIs)

• How can be build statistical language models of full programming language?

166

Page 165: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Program Representation elem.notify({

position: ‘top’,

autoHide: false,

delay: 100

});

167

Sequences

elem, ., notify, (, {,

position, :, ‘top’, ‘,’,

autoHide, :, false, ‘,’,

delay, :, 100,

}, ), ;

Page 166: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Program Representation elem.notify({

position: ‘top’,

autoHide: false,

delay: 100

});

Sequences

elem, ., notify, (, {,

position, :, ‘top’, ‘,’,

autoHide, :, false, ‘,’,

delay, :, 100,

}, ), ;

168

Learned Model

?, ?, ?, (, {,

?, :, ?, ‘,’,

?, :, ?, ‘,’,

?, :, ?,

}, ?, ;

Page 167: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Code Synthesis: Existing Approaches

x arg max P( x | )

promise

defer

reject

Semantic

Hard-coded heuristics

Task & Language specific

x arg max P( x | )

label

(features)

conditioning context

Syntactic

Bad fit for

programs

[Nguyen et al., 2013]

[Allamanis et al., 2014]

[Raychev et al., 2014]

[Hindle et al., 2012]

[Allamanis et al., 2015]

169

Page 168: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Program Representation elem.notify({

position: ‘top’,

autoHide: false,

delay: 100

});

Identifier

elem

Property

notify

Property

position

Property

autoHide

MemberExpression ObjectExpression

CallExpression

Property

delay

String

‘top’

Boolean

false

Trees

170

Page 169: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Probabilistic Context Free Grammars

• 𝑁 is a finite set of non-terminal symbols

• Σ is a finite set of terminal symbols

• 𝑅 is a finite set of production rules 𝛼 → 𝛽1…𝛽𝑛

• 𝑆 ∈ 𝑁 is a start symbol

• 𝑞: 𝑅 → ℝ 0,1 is a conditional probability of choosing given production rule

𝛼 ∈ 𝑁 𝛽𝑖 ∈ (𝑁⋃Σ)

171

Page 170: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Probabilistic Context Free Grammars

Identifier

elem

Property

notify

Property

position

Property

autoHide

MemberExpression ObjectExpression

CallExpression

Property

delay

String

‘top’

Boolean

false

CallExpr → MemberExpr [ elem.notify() ]

CallExpr → MemberExpr Identifier [ elem.notify(x) ]

CallExpr → Identifier [ notify() ]

CallExpr → MemberExpr ObjectExpr [ elem.notify({…}) ]

172

Page 171: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Probabilistic Context Free Grammars

CallExpr → MemberExpr [ elem.notify() ]

CallExpr → MemberExpr Identifier [ elem.notify(x) ]

CallExpr → MemberExpr ObjectExpr [ elem.notify({…}) ]

𝑞 𝛼 → 𝛽1…𝛽𝑛 ≈ #(𝛼 → 𝛽1 …𝛽𝑛)

#(𝛼)

training

∀𝛼 𝑞 𝛼 → 𝛽1…𝛽𝑛 = 1

𝛼→𝛽1…𝛽𝑛∈𝑅

valid probability distribution 173

Page 172: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Probabilistic Context Free Grammars

poor context

for prediction

non-trivial to obtain

good production rules

[Allamanis & Sutton, FSE’14] [Gvero & Kuncak, OOPSLA’15]

[Maddison & Tarlow, ICML’14]

[Liang et.al., ICML’10]

CallExpr → MemberExpr [ elem.notify() ]

CallExpr → MemberExpr Identifier [ elem.notify(x) ]

CallExpr → MemberExpr ObjectExpr [ elem.notify({…}) ]

174

Page 173: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

• Probabilistic Grammars

+ Easy to implement and cheap to train + Defined over generic AST representation of programs - Important to deal with sparsity via smoothing - Non-trivial to define good production rules - Production rules selected based on limited context (e.g., parent non-terminal symbol)

Lessons

175

Page 174: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

• Probabilistic Grammars

+ Easy to implement and cheap to train + Defined over generic AST representation of programs - Important to deal with sparsity via smoothing - Non-trivial to define good production rules - Production rules selected based on limited context (e.g., parent non-terminal symbol)

Lessons

176

PCFG are still poor at modelling full

programming language but are

a step towards the solution.

Page 175: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Our Goal

Probabilistic

Model

High

Precision

Efficient

Learning

Widely

Applicable

Existing Programs Learning Model

Explainable

Predictions

177

Page 176: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Key Idea

178

P( wi | context )

N-grams

RNNs

CRFs

PCFGs

So far: Context defined manually for each model

Page 177: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Key Idea

179

P( wi | context )

N-grams

RNNs

CRFs

PCFGs

So far: Context defined manually for each model

Goal: Use program synthesis and machine learning techniques to learn best context for given query wi

Page 178: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

PHOG: Concepts

Use function to build a probabilistic model.

Generalizes PCFGs to allow conditioning on richer

context.

Program synthesis learns a function that explains the

data. The function returns a conditioning context for a

given query.

180

Page 179: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Generalizing PCFG

Context Free Grammar

𝜶 → 𝜷1… 𝜷n

P

Property → x 0.05

Property → y 0.03

Property → promise 0.001

181

Page 180: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

PHOG: Generalizes PCFG

Higher Order Grammar

𝜶[𝜸] → 𝜷1… 𝜷n

P

Property[reject, promise] → promise 0.67

Property[reject, promise] → notify 0.12

Property[reject, promise] → resolve 0.11

Context Free Grammar

𝜶 → 𝜷1… 𝜷n

P

Property → x 0.05

Property → y 0.03

Property → promise 0.001

182

Page 181: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

𝜶[𝜸] → 𝜷1… 𝜷n

Conditioning on Richer Context

What is the best conditioning context?

183

Page 182: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

𝜶[𝜸] → 𝜷1… 𝜷n

Conditioning on Richer Context

What is the best conditioning context?

- Control Structures

- …

- Identifiers

- Constants

- APIs

- Fields

184

Page 183: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

𝜶[𝜸] → 𝜷1… 𝜷n

Conditioning on Richer Context

What is the best conditioning context?

- Control Structures

- …

- Identifiers

- Constants

- APIs

- Fields

? 𝜸

Source

Code

Conditioning

Context 185

Page 184: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Production Rules R:

𝜶[𝜸] → 𝜷1… 𝜷n

Function:

f: → 𝜸

Higher Order Grammar

Parametrize the grammar by a function

used to dynamically obtain the context

186

Page 185: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Production Rules R:

𝜶[𝜸] → 𝜷1… 𝜷n

Function:

f: AST → 𝜸

Higher Order Grammar

Parametrize the grammar by a function

used to dynamically obtain the context

187

Page 186: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Higher Order Grammar

𝜸 ( ) f

Production Rules R:

𝜶[𝜸] → 𝜷1… 𝜷n

Function:

f: AST → 𝜸

Source

Code

Conditioning

Context

Abstract

Syntax Tree

Function

Application 188

Page 187: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Function Representation

In general:

Unrestricted programs (Turing complete)

Our Work:

TCond Language for navigating over trees

and accumulating context

TCond ::= 𝜀 | WriteOp TCond | MoveOp TCond

MoveOp ::= Up, Left, Right, DownFirst, DownLast,

NextDFS, PrevDFS, NextLeaf, PrevLeaf,

PrevNodeType, PrevNodeValue, PrevNodeContext

WriteOp ::= WriteValue, WriteType, WritePos

189

Page 188: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Expressing functions: TCond Language

Up

Left

TCond ::= 𝜀 | WriteOp TCond | MoveOp TCond

MoveOp ::= Up, Left, Right, DownFirst, DownLast,

NextDFS, PrevDFS, NextLeaf, PrevLeaf,

PrevNodeType, PrevNodeValue, PrevNodeContext

WriteOp ::= WriteValue, WriteType, WritePos

WriteValue

𝜸 ← 𝜸 ∙

190

Page 189: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Example

elem.notify(

... ,

... ,

{

position: ‘top’,

hide: false,

?

}

);

𝜸

Query TCond

191

Page 190: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Example

elem.notify(

... ,

... ,

{

position: ‘top’,

hide: false,

?

}

);

TCond

Left

WriteValue

𝜸

{}

{hide}

Query

192

Page 191: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Example

elem.notify(

... ,

... ,

{

position: ‘top’,

hide: false,

?

}

);

TCond

Left

WriteValue

Up

WritePos

𝜸

{}

{hide}

{hide}

{hide, 3}

Query

193

Page 192: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Example

elem.notify(

... ,

... ,

{

position: ‘top’,

hide: false,

?

}

);

TCond

Left

WriteValue

Up

WritePos

Up

DownFirst

DownLast

WriteValue

𝜸

{}

{hide}

{hide}

{hide, 3}

{hide, 3}

{hide, 3}

{hide, 3}

{hide, 3, notify}

Query

194

Page 193: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

elem.notify(

... ,

... ,

{

position: ‘top’,

hide: false,

?

}

);

Example

{ Previous Property, Parameter Position, API name }

TCond

Left

WriteValue

Up

WritePos

Up

DownFirst

DownLast

WriteValue

𝜸

{}

{hide}

{hide}

{hide, 3}

{hide, 3}

{hide, 3}

{hide, 3}

{hide, 3, notify}

Query

195

Page 194: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Learning PHOG

TCond ::= 𝜀 | WriteOp TCond | MoveOp TCond

MoveOp ::= Up, Left, Right, ...

WriteOp ::= WriteValue, WriteType, ...

fbest = arg min cost(D, f) f ∊ TCond

Existing Dataset

TCond Language

196

Page 195: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Synthesis: Practically Intractable

• Millions (≈ 108) of input/output examples

𝑓𝑏𝑒𝑠𝑡 = argmin 𝑐𝑜𝑠𝑡(𝐷, 𝑓) f ∈ 𝑇𝐶𝑜𝑛𝑑

𝑂( 𝐷 ) computing cost

Synthesis: practically intractable

Key Idea: iterative synthesis on fraction of examples

197

Page 196: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Solution: Two Components

Program Generator

Dataset Sampler

𝑓𝑏𝑒𝑠𝑡 = argmin 𝑐𝑜𝑠𝑡(𝑑, 𝑓) f ∈ 𝑇𝐶𝑜𝑛𝑑

representative dataset sampler picks dataset 𝑑 ⊆ 𝐷

198

Page 197: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

In a Loop

• Start with small number of samples 𝑑 ⊆ 𝐷

• Iteratively generate programs and samples

Program Generator

Dataset Sampler

Program Generator

Dataset Sampler

𝑝1, 𝑝2

Dataset Sampler 𝑑1 ⊆ 𝐷

𝑑2 ⊆ 𝐷

𝑝1

𝑝𝑏𝑒𝑠𝑡

199

Page 198: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Representative Dataset Sampler

Idea: pick a small dataset 𝑑 for which a set of already generated programs 𝑝1…𝑝𝑛 behave like on full dataset 𝐷

𝑑 = arg𝑚𝑖𝑛𝑑⊆𝐷𝑚𝑎𝑥𝑖 ∈1 …𝑛 𝑐𝑜𝑠𝑡 𝑑, 𝑝𝑖 − 𝑐𝑜𝑠𝑡(𝐷, 𝑝𝑖)

cost on 𝑑

cost on 𝐷

𝑝1

𝑝1

𝑝2

𝑝2

200

Page 199: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Representative Dataset Sampler

Idea: pick a small dataset 𝑑 for which a set of already generated programs 𝑝1…𝑝𝑛 behave like on full dataset 𝐷

Theorem: The sampler shrinks search space of candidate programs

𝑑 = arg𝑚𝑖𝑛𝑑⊆𝐷𝑚𝑎𝑥𝑖 ∈1 …𝑛 𝑐𝑜𝑠𝑡 𝑑, 𝑝𝑖 − 𝑐𝑜𝑠𝑡(𝐷, 𝑝𝑖)

201

Page 200: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Results

Probabilistic Model of JavaScript Language

50k Blind Set 100k Training 20k Learning

202

Page 201: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Results: JavaScript APIs

Last two tokens, Hindle et. al. [ICSE’12]

Last two APIs, Raychev et. al. [PLDI’14]

Program synthesis

Program synthesis + dataset sampler

Conditioning program p Accuracy

22.2%

30.4%

46.3%

50.4%

203

Page 202: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Results: JavaScript Structure

PCFG

N-gram

Naïve Bayes

Program synthesis

Model Accuracy

51.1%

71.3%

44.2%

81.5%

SVM 70.5%

204

Page 203: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Results: JavaScript Values

Identifier

Property

String

Number

Accuracy Example

RegExp

UnaryExpr

BinaryExpr

LogicalExpr

62%

65%

52%

64%

66%

97%

74%

92%

contains = jQuery …

start = list.length;

‘[‘ + attrs + ‘]’

canvas(xy[0], xy[1], …)

line.replace(/(&nbsp;| )+/, …)

if (!events || !…)

while (++index < …)

frame = frame || …

205

Page 204: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

• Motivation • Potential applications

• Statistical Language Models

• N-gram, Recurrent Networks, PCFGs • Application: code completion

• Graphical Models

• Markov Networks, Conditional Random Fields • Inference & Learning in Markov Networks • Application: predicting names and types

• Learning Features for Programs • Combining Program Synthesis + Machine Learning

Tutorial Outline

206

Page 205: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation

alias analysis typestate analysis

N-gram language model

SVM Structured SVM

Neural Networks

Sequences

(sentences) Trees

Translation Table

Feature Vectors

control-flow

analysis scope analysis

argmax P(y | x)

y ∈ Ω

http://www.nice2predict.org/

http://www.srl.inf.ethz.ch/

More information

and tutorials at:

Greedy

MAP inference

Graphical Models (CRFs)

Machine Learning for Programs

Page 206: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Statistical Programming Tools

208

Understand code/security [POPL’15]:

JavaScript Deobfuscation

Type Prediction

...

for x in range(a):

print a[x]

Debug code:

Statistical Bug Detection

Write new code [PLDI’14]:

Code Completion

Camera camera = Camera.open();

camera.SetDisplayOrientation(90);

?

likely

error

Port code [ONWARD’14]:

Programming Language Translation

www.jsnice.org

All of these benefit from the probabilistic model for code.

Page 207: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation

alias analysis typestate analysis

N-gram language model

SVM Structured SVM

Neural Networks

Sequences

(sentences) Trees

Translation Table

Feature Vectors

control-flow

analysis scope analysis

argmax P(y | x)

y ∈ Ω

http://www.nice2predict.org/

http://www.srl.inf.ethz.ch/

More information

and tutorials at:

Greedy

MAP inference

Graphical Models (CRFs)

Machine Learning for Programs

Page 208: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Materials

Dagstuhl Seminar on Big Code Analytics, Nov 2015 http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=15472

– ML, NLP, PL, SE

– Link to materials, people in the general area

http://learningfrombigcode.org

– data sets, tools, challenges

http://nice2predict.org

– fully open source, online demo, build your own tool

210

Page 209: Machine Learning for Programs - GitHub Pages · Learning and probabilistic models based on Big Data have revolutionized entire fields Natural Language Processing (e.g., machine translation)

Martin

Vechev

Veselin

Raychev

Pavol

Bielik Christine

Zeller

Svetoslav

Karaivanov

Pascal

Roos

Benjamin

Bischel Andreas

Krause

More information: http://www.srl.inf.ethz.ch/

Probabilistically likely solutions to problems impossible to solve otherwise

Timon

Gehr

Publications

PHOG: Probabilistic Mode for Code, ACM ICML’16

Learning Programs from Noisy Data, ACM POPL’16

Predicting Program Properties from “Big Code”, ACM POPL’15

Code Completion with Statistical Language Models, ACM PLDI’14

Machine Translation for Programming Languages, ACM Onward’14

http://JSNice.org

statistical de-obfuscation

http://Nice2Predict.org

structured prediction framework

Publicly Available Tools

Petar

Tsankov

Probabilistic Learning from Big Code