machine learning for programs - github pages · learning and probabilistic models based on big data...

Martin Vechev Pavol Bielik

Department of Computer Science

ETH Zurich

Machine Learning for Programs

Learning and probabilistic models based on Big Data have revolutionized entire fields

Natural Language Processing (e.g., machine translation)

Computer Vision (e.g., image captioning)

A group of people shopping at an outdoor market. There are many vegetables at the fruit stand.

Medical Computing (e.g., disease prediction)

Can we bring this revolution to programmers?

Learning and probabilistic models based on Big Data have revolutionized entire fields

Natural Language Processing (e.g., machine translation)

Computer Vision (e.g., image captioning)

A group of people shopping at an outdoor market. There are many vegetables at the fruit stand.

Medical Computing (e.g., disease prediction)

Vision New kinds of techniques and tools that learn from Big Code

Statistical Programming Tool

Programming Task

Solution

probabilistic model

15 million repositories

Billions of lines of code

High quality, tested,

maintained programs

last 8 years

number of repositories

Martin

Vechev

Veselin

Raychev

Pavol

Bielik Christine

Zeller

Svetoslav

Karaivanov

Pascal

Roos

Benjamin

Bischel Andreas

Krause

More information: http://www.srl.inf.ethz.ch/

Probabilistically likely solutions to problems impossible to solve otherwise

Timon

Gehr

Publications

PHOG: Probabilistic Mode for Code, ACM ICML’16

Learning Programs from Noisy Data, ACM POPL’16

Predicting Program Properties from “Big Code”, ACM POPL’15

Code Completion with Statistical Language Models, ACM PLDI’14

Machine Translation for Programming Languages, ACM Onward’14

http://JSNice.org

statistical de-obfuscation

http://Nice2Predict.org

structured prediction framework

Publicly Available Tools

Petar

Tsankov

Probabilistic Learning from Big Code

http://www.srl.inf.ethz.ch/spas.php


http://jsnice.org/

http://nice2predict.org/

• Motivation • Applications

• Statistical Language Models

• N-gram, Recurrent Networks, PCFGs • Application: code completion

• Graphical Models

• Markov Networks, Conditional Random Fields • Inference & Learning in Markov Networks • Application: predicting names and types

• Learning Features for Programs • Combining Program Synthesis + Machine Learning

Tutorial Outline

45 min

15 min break

60 min

15 min break

45 min

Statistical Programming Tools

7

Write new code [PLDI’14]:

Code Completion

Camera camera = Camera.open();

camera.SetDisplayOrientation(90);

?


8


Code Completion



?

Port code [ONWARD’14]:

Programming Language Translation


9

...

for x in range(a):

print a[x]

Debug code:

Statistical Bug Detection


Code Completion



?

likely

error




10

Understand code/security [POPL’15]:

JavaScript Deobfuscation

Type Prediction

...

for x in range(a):

print a[x]

Debug code:



Code Completion



?

likely

error



www.jsnice.org

www.jsnice.org

www.jsnice.org

JSNice.org

Every country Top ranked tool 150,000+ users

[Predicting program properties from Big Code, ACM POPL 2015]


12



Type Prediction

...

for x in range(a):

print a[x]

Debug code:



Code Completion



?

likely

error



www.jsnice.org

All of these benefit from the probabilistic models for code.

www.jsnice.org

www.jsnice.org

• Motivation • Applications






Tutorial Outline

13

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)


14

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation


15

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation

Sequences

(sentences) Trees

Translation Table

Feature Vectors

Graphical Models (CRFs)


16

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation

alias analysis typestate analysis

Sequences

(sentences) Trees

Translation Table

Feature Vectors

control-flow

analysis scope analysis



17

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation


N-gram language model

SVM Structured SVM

Neural Networks

Sequences

(sentences) Trees

Translation Table

Feature Vectors

control-flow




18

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation



SVM Structured SVM

Neural Networks

Sequences

(sentences) Trees

Translation Table

Feature Vectors

control-flow


argmax P(y | x)

y ∈ Ω

Greedy

MAP inference



19

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation



SVM Structured SVM

Neural Networks

Sequences

(sentences) Trees

Translation Table

Feature Vectors

control-flow


argmax P(y | x)

y ∈ Ω

Greedy

MAP inference



20

Motivation: Working with APIs

21

Camera camera = Camera.open(); camera.setDisplayOrientation(90); camera.unlock(); MediaRecorder rec = new MediaRecorder(); rec.setCamera(camera); rec.setAudioSource(MediaRecorder.AudioSource.MIC); rec.setVideoSource(MediaRecorder.VideoSource.DEFAULT); rec.setOutputFormat(MediaRecorder.OutputFormat.MPEG_4); rec.setAudioEncoder(1); rec.setVideoEncoder(3); rec.setOutputFile("file.mp4"); ...

?

?

?

Statistical Code Synthesis: Capabilities [Code Completion with Statistical Language Models, V. Raychev, M. Vechev, E. Yahav ,PLDI’14]

22


[Code Completion with Statistical Language Models, V. Raychev, M. Vechev, E. Yahav ,PLDI’14]

Statistical Code Synthesis: Capabilities

23


Handles multiple

objects



24


Handles multiple

objects

Infers multiple

statements



25


Handles multiple

objects

Infers sequences

not in training data

Infers multiple

statements

Statistical Code Synthesis: Capabilities [Code Completion with Statistical Language Models, V. Raychev, M. Vechev, E. Yahav ,PLDI’14]

26

Regularities in code are similar to regularities in natural language

Key Insight

27


We want to learn that MediaRecorder rec = new MediaRecorder(); is before

rec.setCamera(camera);

Key Insight

28


We want to learn that MediaRecorder rec = new MediaRecorder(); is before

rec.setCamera(camera); like in natural languages

Hello

is before

World !

Key Insight

29

From Programs to Sentences

she = new X(); me = new Y(); me.sleep(); if (random()) { me.eat(); } she.enter(); me.talk(she);

30



Typestate analysis

31



Typestate analysis Alias analysis

32



Typestate analysis Alias analysis for abstract object me:

33



Typestate analysis Alias analysis for abstract object me: Yinit sleep talk

34




Yinit sleep eat talk

35





for abstract object she:

36





for abstract object she: Xinit enter talkparam1

37

Training phase

Program

Analysis

sentences


38

Learn Regularities

Learn regularities in obtained sentences

Regularities in sentences ⇔ regularities in API usage

If we see many sequences like: Xinit enter talkparam1

then we should learn that talkparam1 is often after enter

39

Statistical Language Models

Given a sentence s = w1 w2 w3 … wn

estimate P( w1 w2 w3 … wn )

Queries:

P(“Hello World!”) ≫ P(“World! Hello”)

arg𝑚𝑎𝑥 P(“Hello” y)

40

y




Decomposed to conditional probabilities

P( w1 w2 w3 … wn ) = ∏i=1..n P( wi | w1 … wi-1 )

41


Conditional probability only on previous n-1 words

P( wi | w1 … wi-1 ) ≈ P( wi | wi-n+1 … wi-1 )

43



P( wi | w1 … wi-1 ) ≈ P( wi | wi-n+1 … wi-1 )

n-1 words

44



P( wi | w1 … wi-1 ) ≈ P( wi | wi-n+1 … wi-1 ) ≈

n-1 words

#(wi-n+1 … wi-1 wi )

#(wi-n+1 … wi-1 )

45



P( wi | w1 … wi-1 ) ≈ P( wi | wi-n+1 … wi-1 ) ≈

n-1 words

P( jumped | The quick brown fox ) ≈ P( jumped | brown fox ) ≈

#( brown fox jumped )

#( brown fox )

#( n-gram ) - number of

occurrences of n-gram in

training data

Training is achieved by counting n-grams.

E.g., with 3-gram language model, we get:

Time complexity for each word

encountered in training is constant, so

training is usually fast.

#(wi-n+1 … wi-1 wi )

#(wi-n+1 … wi-1 )

46

Tri-gram language model

3-grams # of occurrences

brown fox jumped 125

brown fox walked 45

brown fox snapped 30

P (jumped | brown fox) ~= 125/200 ~= 0.625

P (w1 w2 w3 … wn) ~= P (w1) P (w2 w1) P (w3 w1 w2) … P (wn wn-2 wn-1)

47

Tri-gram language model

3-grams # of occurrences

brown fox jumped 125

brown fox walked 45

brown fox snapped 30

P (jumped | brown fox) ~= 125/200 ~= 0.625

P (brown fox jumped) ~=

P (brown) P (fox | brown) P (jumped | brown fox)

200 / 600 200 / 200 125 / 200 ~= 0.208

P (w1 w2 w3 … wn) ~= P (w1) P (w2 w1) P (w3 w1 w2) … P (wn wn-2 wn-1)

48

Key Problem: Sparsity of Data

The problem of sparsity gets worse as the size of the n-gram becomes larger.

We need to handle n-grams with 0 or few occurrences in the training data.

Techniques that can do that are: smoothing, discounting

#( red fox jumped )

#( red fox )

What if this number is 0?

P( jumped | red fox ) ≈

49

• Smoothing techniques give non-zero probability to n-grams not in

the training data.

• Essentially, they try to estimate how likely it is that the n-gram is

missing due to the limited size of the training data.

• They work by taking the probability mass of the existing n-grams

and redistributing that mass over n-grams that occur zero times.

• Typically, probability of such n-grams is estimated by looking at

the probabilities of n-1 grams , n-2 grams, unigrams, etc.

Solution: Smoothing Techniques

50

Smoothing: Intuitively

Distribution of

probability mass

before smoothing

Training Data

All other

sentences

get 0 probability

All other

sentences

get non-zero

probability

Distribution of

probability mass

after smoothing

Training Data Smoothing

51

Smoothing: Intuitively

Distribution of

probability mass

before smoothing

Training Data

All other

sentences

get 0 probability

All other

sentences

get non-zero

probability

Distribution of

probability mass

after smoothing

Training Data Smoothing

52

Smoothing Techniques

Example smoothing techniques are: Witten-Bell Interpolated (WBI), Witten-

Bell Backoff (WBB), Natural Discounting (ND), Stupid-Backoff (SB), … 53

Psmooth( wi | wi-N+1) = i-1

PML( wi | wi-n+1 … wi-1)

γ(wi-N+1) Psmooth( wi | wi-n+2 … wi-1) i-1

If wi-n+1 … wi in training data

otherwise

Backoff techniques:

Smoothing Techniques

Psmooth( wi | wi-N+1) = i-1

PML( wi | wi-n+1 … wi-1)

γ(wi-N+1) Psmooth( wi | wi-n+2 … wi-1) i-1

If wi-n+1 … wi in training data

otherwise

Backoff techniques:

Psmooth( wi | wi-N+1) = λ PML( wi | wi-N+1) + ( 1 – λ ) Psmooth( wi | wi-N+2) i-1

Interpolation techniques:

i-1 i-1

Example smoothing techniques are: Witten-Bell Interpolated (WBI), Witten-

Bell Backoff (WBB), Natural Discounting (ND), Stupid-Backoff (SB), … 54

Summary: N-gram Language Model

Existing library implementing language models is SRILM:

http://www.speech.sri.com/projects/srilm/


P( wi | w1 … wi-1 ) ≈ P( wi | wi-n+1 … wi-1 ) ≈

n-1 words

#(wi-n+1 … wi-1 wi )

#(wi-n+1 … wi-1 )

+ Easy to implement

+ Fast to train (simply counting)

- Data sparsity issues

- Can be tricky to capture long distance relationships

55



Training phase

Program

Analysis

Language

model

Train

Language

Model

sentences

Learn Regularities

N-gram

56

Recurrent Neural Networks (RNNs)

57

0 0 1 0 0 … input word

… hidden layer

(state i)

… output layer

softmax classifier

probability

of “The”

probability

of “brown”

“The”


58


… hidden layer

(state i)

… output layer

softmax classifier

probability

of “The”

probability

of “brown”

1 0 0 0 0 …

…

…

…

“The” “quick”


59


… hidden layer

(state i)

… output layer

softmax classifier

probability

of “The”

probability

of “brown”

1 0 0 0 0 …

…

…

…

U

V

W

U

V W



60


… hidden layer

(state i)

… output layer

softmax classifier

probability

of “The”

probability

of “brown”

… …

U

V

W

U

V W


Matrices U, V and W

learned through back

propagation with time

and some unfolding.


61


… hidden layer

(state i)

… output layer

softmax classifier

probability

of “The”

probability

of “brown”

… …

U

V

W

U

V W


Matrices U, V and W



and some unfolding.

Matrix U learns a low-

dimensional continuous

vector representation of

input words also referred

to as “word embedding”.

Word Embeddings

[Linguistic Regularities in Continuous

Space Word Representations, Mikolov

et.al., 2013 (NAACL-HLT-2013)]

62

Matrices U, V and W



and some unfolding.

Matrix U learns a low-

dimensional continuous

vector representation of

input words also referred

to as “word embedding”.

Summary: RNNs

+ RNNs can learn dependencies beyond

the prior several words (LSTM networks)

+ Active research area

+ Learns continuous representation

of input vocabulary

- Slow to train

- Difficult to train

- Not interpretable

word(i) word(i+1)

U

V

W

state(i)

state(i-1)

hidden layer input word output words

http://rnnlm.org/

RNNLM-40 Download:

https://www.tensorflow.org/ 63

http://rnnlm.org/

http://rnnlm.org/

https://www.tensorflow.org/

Training phase

Program

Analysis

Language

model

Train

Language

Model

sentences

Learn Regularities

N-gram

+

RNN

64


camera.setDisplayOrientation(90);

camera.unlock();

MediaRecorder rec = new

MediaRecorder();


rec.setAudioSource(MediaRecorder.A

udioSource.MIC);

rec.setVideoSource(MediaRecorder.Vi

deoSource.DEFAULT);

rec.setOutputFormat(MediaRecorder.

OutputFormat.MPEG_4);

rec.setAudioEncoder(1);

rec.setVideoEncoder(3);

rec.setOutputFile("file.mp4");

...

?

?

?



camera.unlock();


MediaRecorder();



udioSource.MIC);


deoSource.DEFAULT);






...

incomplete program

completed

program

Training phase

Program

Analysis

Train

Language

Model

Language

model sentences

N-gram

+

RNN

The SLANG System

65

Statistical Code Synthesis

smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); ? {smsMgr, list} // (Hole H1) } else { ? {smsMgr, message} // (Hole H2) }

66



Abstract object smsMgr:

getDefaultresult divideMessage H1 getDefaultresult H2

67





68



Abstract object list:

divideMessageresult H1



69







70







Abstract object message: length divideMessageparam1

length H2

71







Abstract object message: length divideMessageparam1

length H2

72


arg𝑚𝑎𝑥 P(getDefaultresult divideMessage H1) H1

getDefaultresult divideMessage H1



arg𝑚𝑎𝑥 P(getDefaultresult divideMessage H1) H1

getDefaultresult divideMessage sendMultipartTextMessage 0.0033

Completion probability Abstract object smsMgr:


Completion probability


getDefaultresult divideMessage sendTextMessage 0.0016




getDefaultresult sendTextMessage 0.0073

getDefaultresult sendMultipartTextMessage 0.0010

divideMessageresult sendMultipartTextMessageparam3 0.0821

length length 0.0132

length split 0.0080

length sendTextMessageparam3 0.0017










length split 0.0080



Not a feasible solution:

completions disagree on

selected method

The solution must satisfy

program constraints









length split 0.0080





smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); smsMgr.sendMultipartTextMessage(...list...); } else { smsMgr.sendTextMessage(...message...); }


81



camera.unlock();


MediaRecorder();



udioSource.MIC);


deoSource.DEFAULT);






...

?

?

?



camera.unlock();


MediaRecorder();



udioSource.MIC);


deoSource.DEFAULT);






... Completion phase

Program

Analysis Combine Query

sentences

with holes

completed

sentences incomplete program

completed

program

Training phase

Program

Analysis

Train

Languag

e Model

Language

model sentences

1M Java

methods

~700MB

~100MB

84 testing

samples

Correct completion

in top 3 for ~90% of

cases

The SLANG System

82

• Connection to PL • decide what to ``compile’’ into the word

• Program Analysis need not be sound

• N-gram Models

• Easy and cheap to train • Important to deal with sparsity via smoothing • Can be tricky to capture long distance relationships

• Recurrent Networks • Can capture longer distance relationships • More expensive to train • Experimentally similar to language models

Lessons

83

1

no alias analysis

with alias analysis

1% 10% 100%

Precision vs. % of Data used

0

0.5

Static Analysis Benefit = 10x more Data

Importance of Static Analysis

Models of Full Language

• So far:

– N-gram and RNNs

– Modelling specific elements (APIs)

• How can be build statistical language models of full programming language?

85

• Motivation • Potential applications






Tutorial Outline

86

Applications

Natural Language Processing

the dog saw a man

in the park

S

NP VP

Det N V NP PP

Det N

Det N P NNP

Computer Vision Programming

fun chunkData(e,t)

var n = [];

var r = e.length;

var i = 0;

for(; i < r; i += t)

…

return n;

87

Facts to be predicted are dependent

Millions of candidate choices

Must quickly learn from huge codebases

Prediction should be fast (real time)

Challenges

88

Phrase the problem of predicting program facts as

Structured Prediction for Programs

Key Idea

89

Structured Prediction for Programs [V. Raychev, M. Vechev, A. Krause, ACM POPL’15]

Program

Analysis

Graphical

Models

Inference

Learning

First connection between Programs and Conditional Random Fields

Knowledge

Reasoning

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation



SVM Structured SVM

Neural Networks

Sequences

(sentences) Trees

Translation Table

control-flow


argmax P(y | x)

y ∈ Ω

http://www.nice2predict.org/

http://www.srl.inf.ethz.ch/

More information

and tutorials at:

Greedy

MAP inference



91

Log-linear Models Markov Networks





Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation



SVM Structured SVM

Neural Networks

Sequences

(sentences) Trees

Translation Table

Log-linear Models

control-flow


argmax P(y | x)

y ∈ Ω



More information

and tutorials at:

Greedy

MAP inference



92

Markov Networks





Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation



SVM Structured SVM

Neural Networks

Sequences

(sentences) Trees

Translation Table

control-flow


argmax P(y | x)

y ∈ Ω



More information

and tutorials at:

Greedy

MAP inference



Log-linear Models Markov Networks





Defining a Global Probabilistic Model

Joint probability distribution

P(A, B, C, D)

94



A B C D

0 0 0 0 P(a0, b0, c0, d0) = 0.04

0 0 0 1 P(a0, b0, c0, d1) = 0.04

0 0 1 0 P(a0, b0, c1, d0) = 0.04

0 0 1 1 P(a0, b0, c1, d1) = 4.1 10-6

… … … … …

P(A, B, C, D)

Table of joint probabilities

95



A B C D

0 0 0 0 P(a0, b0, c0, d0) = 0.04

0 0 0 1 P(a0, b0, c0, d1) = 0.04

0 0 1 0 P(a0, b0, c1, d0) = 0.04

0 0 1 1 P(a0, b0, c1, d1) = 4.1 10-6

… … … … …

P(A, B, C, D)

Table of joint probabilities 𝑛 variables → 𝑂 2𝑛 entries!

96


P(A, B, C, D) = F(A,B,C,D)

F(A, B, C, D) = 1 (A,B) 2 (B,C) 3 (C,D) 4 (D,A)

Define the global model by combining

a set of local interactions between variables

97


Factors

• A factor (or potential) is a function from a set of random variables D to a real number R

: D R

• The set of variables D is the scope of the factor – we are typically concerned with non-negative factors

• Intuition: captures affinity, agreement, compatibility of the variables in D

98

Factors: Example

A B value

0 0 30

0 1 5

1 0 1

1 1 10

• Example factors – 1 (0,0) = 30 says we believe A and B agree on 0 with belief 30

– 3 (C,D) says that C and D argue all the time

1 (A,B)

99

Factors: Example

A B value

0 0 30

0 1 5

1 0 1

1 1 10



1 (A,B)

B C value

0 0 100

0 1 1

1 0 1

1 1 100

2 (B,C)

C D value

0 0 1

0 1 100

1 0 100

1 1 1

3 (C,D)

D A value

0 0 100

0 1 1

1 0 1

1 1 100

4 (D,A)

100

Factors: Example

A B value

0 0 30

0 1 5

1 0 1

1 1 10

1 (A,B)

B C value

0 0 100

0 1 1

1 0 1

1 1 100

2 (B,C)

C D value

0 0 1

0 1 100

1 0 100

1 1 1

3 (C,D)

D A value

0 0 100

0 1 1

1 0 1

1 1 100

4 (D,A)

Key point: (Local) factors have no probabilistic interpretation





P(A, B, C, D) = F(A,B,C,D)

F(A, B, C, D) = 1 (A,B) 2 (B,C) 3 (C,D) 4 (D,A)

102


P(A, B, C, D) = F(A,B,C,D) / Z(A,B,C,D)

F(A, B, C, D) = 1 (A,B) 2 (B,C) 3 (C,D) 4 (D,A)

Z(A,B,C,D) = F(a, b, c, d) a A, b B

c C, d D

Partition function


103

P(A,B,C,D)

A B C D Unnormalized (F)

Normalized (P = F/Z)

0 0 0 0 300,000 0.04

0 0 0 1 300,000 0.04

0 0 1 0 300,000 0.04

0 0 1 1 30 4.1 10-6

… … … … … …

Z(A,B,C,D) = 7,201,840

104

A B value

0 0 30

0 1 5

1 0 1

1 1 10

1 (A,B)

B C value

0 0 100

0 1 1

1 0 1

1 1 100

2 (B,C)

C D value

0 0 1

0 1 100

1 0 100

1 1 1

3 (C,D)

D A value

0 0 100

0 1 1

1 0 1

1 1 100

4 (D,A)

P(A,B,C,D)

A B C D Unnormalized (F)

Normalized (P = F/Z)

0 0 0 0 300,000 0.04

0 0 0 1 300,000 0.04

0 0 1 0 300,000 0.04

0 0 1 1 30 4.1 10-6

… … … … … …

Z(A,B,C,D) = 7,201,840

We can now answer probability queries on P(A, B, C, D):

For example: P(A = 0, B = 0) = P(0, 0, C, D) ~= 0.125 c C

d D

105

Let X = {X1, …, Xn} be a set of random variables and D1, … , Dm X

Then, a Markov Network is defined as:

P(X1, X2, …, Xn) = 1 (D1) 2 (D2) … m (Dm)

Z(X1, X2, …, Xn)

Z(X1, X2, …, Xn) = 1 (D1) 2 (D2) … m (Dm)

Markov Network

106

Graphs vs. Probability Distributions

We will next relate graphs and probability distributions. This is

important for two reasons.

First, it allows us to determine properties (e.g., independence of

variables) of a probability distribution directly from the graph.

Second, it allows us to answer queries (e.g., MAP inference) on

the probability distribution by working with the graph.

107


P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z

108

Which variables are independent?


A B

C D

P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z

1 (A, B)

2 (B, D)

3 (D, C)

4 (C, A)

Intuitively, variables are nodes and factors introduce edges

109

A distribution P equipped with a set of factors 1 (D1) , … , m (Dm)

factorizes over a graph G if each Di is a complete subgraph of G

Graph Factorization

110



A B

C D

P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z

Graph Factorization

111



A B

C D

P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z

Here, P factorizes over G

Graph Factorization

112



A B

C D

P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z


A B

C D

P(A, B, C, D) = 1 (A, B, C) 2 (B, D, C) / Z

Graph Factorization

113



A B

C D

P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z


A B

C D

P(A, B, C, D) = 1 (A, B, C) 2 (B, D, C) / Z

Here, P does not factorize over G

Graph Factorization

114



A B

C D

P(A, B, C, D) = 1 (A, B) 2 (A, C) 3 (A, D)

4 (B, C) 5 (B, D) 6 (C, D) / Z

A B

C D

P(A, B, C, D) = 1 (A, B, C, D) / Z

Graph Factorization

115

Graphs vs. Probabilities

There is an important connection between a probability distribution that factorizes over a graph and the properties of the graph.

In particular, we can discover various independence properties of the probabilistic distribution directly from the graph

116

Two sets of disjoint nodes A and B in a graph G are separated by a set S, if

every path between A and B goes through a node in S. A and B are also

separated if no path between A and B exists regardless of set S.

The Global Markov Property says that if A and B are separated by S, then

A B | S that is P(A, B | S) = P(A | S) P(B | S)

Graph Separation

117

A B

C D

E F

G

Graph Separation: Example

118

Example of global independencies found in the above graph:

{A} {G} | {D}

{B} {F} | {E,G}

A B

C D

E F

G

Graph Separation: Example

119

P(A, G | D) = P(A | D) P(G | D)

Graphical Models: So far

So far we learned what a Markov Network is, what it means for a probability distribution to factor over a graph and how to extract information about independence of variables from the graph.

So far we have been discussing the joint probability distribution P(X1,…,Xn). We next focus on conditional distributions.

120

So Far: Generative Models

121

Next: Discriminative Models

Predict unknown facts given some known facts

122

Next: Discriminative Models

Predict unknown facts given some known facts

✘ ✘ ✘

✘ do not model distribution over known facts

Conditional Random Field

Let X Y be a set of random variables where X = {X1,…, Xn}, Y = {Y1,…,

Yn}. Here, X are the observed variables and Y are the target variables.

Let D1, D2, … , Dm X Y where Di X

A Conditional Random Field is defined as:

P(Y1,…,Yn X1,…, Xn) = 1 (D1) 2 (D2) … m (Dm)

Z(X1, …, Xn)

Note that the Z function ranges over variables in X only.

Y Z(X1,…, Xn) = 1 (D1) 2 (D2) … m (Dm)

[J. Lafferty, A. McCallum, F. Pereira, ICML 2001]

124

Key advantage: avoids encoding distribution over the variables X. Thus,

we can use many observed variables with complex dependencies without

needing to model any joint distributions over them.

P(Y1,…,Yn X1,…, Xn) = 1 (D1) 2 (D2) … m (Dm)

Z(X1, …, Xn)

Z(X1,…, Xn) = 1 (D1) 2 (D2) … m (Dm)

Conditional Random Field [J. Lafferty, A. McCallum, F. Pereira, ICML 2001]

Y

125

Conditional Random Field: Example Text Analytics (Named Entity Recognition)

Problem Statement:

Given a sentence, label each word with whether it is a person or a location

126


Problem Statement:


5 Labels: B-per, I-per, B-loc, I-loc, O

127


Problem Statement:



Mr. Smith came today to Toronto Given:

128


Problem Statement:


Mr. Smith came today to Toronto

B-per O O O B-loc I-per

Given:

Predict:


129


Problem Statement:




X1 X2 X3 X4 X6 X7

130


Problem Statement:




Y1 Y2 Y3 Y4 Y6 Y7

X1 X2 X3 X4 X6 X7

131


Problem Statement:



Two kinds of factors for each word Xt 1t (Yt , Yt+1) 2

t (Yt , X1 , … , Xt)

Captures relationship between

neighboring predictions

Relationship between prediction

and the neighbors of the word


X1 X2 X3 X4 X6 X7

Y1 Y2 Y3 Y4 Y6 Y7

132


Problem Statement:



Two kinds of factors for each word Xt 1t (Yt , Yt+1) 2

t (Yt , X1 , … , Xt)


Y1 Y2 Y3 Y4 Y6 Y7

X1 X2 X3 X4 X6 X7

133


Problem Statement:


P(Y1,…,Y7 X1,…, X7) =

11 (Y1 , Y2) … 7

6 (Y6 , Y7) 21 (Y1,X1,X2) … 2

7 (Y7,X6,X7)

11 (Y1 , Y2) … 7

6 (Y6 , Y7) 21 (Y1,X1,X2) … 2

7 (Y7,X6,X7)

Y1,Y2,Y3,Y4,Y5,Y6, Y7


Y1 Y2 Y3 Y4 Y6 Y7

X1 X2 X3 X4 X6 X7

134


Problem Statement:


P(Y1,…,Y7 X1,…, X7) =

11 (Y1 , Y2) … 7

6 (Y6 , Y7) 21 (Y1,X1,X2) … 2

7 (Y7,X6,X7)

11 (Y1 , Y2) … 7

6 (Y6 , Y7) 21 (Y1,X1,X2) … 2

7 (Y7,X6,X7)

Y1,Y2,Y3,Y4,Y5,Y6, Y7


Y1 Y2 Y3 Y4 Y6 Y7

X1 X2 X3 X4 X6 X7

B-per I-per B-loc O O O

135

Learning from Data

We next look at equivalent representation of Markov Networks and Conditional Random Fields that are amenable to learning from data (e.g., log-linear form).

136

Another representation of a Markov Network is that of a log-linear model

Here, we have a set of feature functions 𝑓1(D1),…, 𝑓𝑘(D𝑘) where each D𝑖 is a

complete subgraph of G. An 𝑓𝑖(D𝑖) can return any value, i.e., can be negative

Z(X1, …, Xn) = 𝑒𝑥𝑝− 𝑤𝑖 𝑓𝑖(D𝑖)𝑘

𝑖 =1

P(X1,…,Xn) =𝑒𝑥𝑝−𝑤1 𝑓

1D1 … 𝑒𝑥𝑝−𝑤𝑘 𝑓

𝑘D𝑘

Z(X1,…,Xn)

Any Markov Network whose factors are positive can be converted to a log-linear model

Log-Linear Models

137

Log-linear models have few benefits over factors. They:

1. Make certain relationships more explicit

2. Offer a much more compact representation for many distributions,

especially for variables with large domains (e.g., names in JSNice)

3. They are useful for learning where we are given the feature functions and

the learning phase simply figures out the weights.

Z(X1, …, Xn) = 𝑒𝑥𝑝− 𝑤𝑖 𝑓𝑖(D𝑖)𝑘

𝑖 =1

Log-Linear Representation: Benefits

P(X1,…,Xn) =𝑒𝑥𝑝−𝑤1 𝑓

1D1 … 𝑒𝑥𝑝−𝑤𝑘 𝑓

𝑘D𝑘

Z(X1,…,Xn)

Feature Function: Example For example, suppose that for the factor 1 :

1 (A, B) = 100 if A = B

1 otherwise

If A and B range over m and n values respectively, we have m n possible

values to store in order to keep this factor.

140

A B value

0 0 100

0 1 1

1 0 1

1 1 100

2 (A,B)

Feature Function: Example For example, suppose that for the factor 1 :

1 (A, B) = 100 if A = B

1 otherwise

If A and B range over m and n values respectively, we have m n possible

values to store in order to keep this factor.

Instead, we can use a single (indicator) feature function 𝑓1 A, B where 𝑓1 A, B = 1 if A = B and 𝑤1 = -4.61

0 else

141

A B value

0 0 100

0 1 1

1 0 1

1 1 100

2 (A,B)

Indicator functions can be directly extracted from data during learning (later)

Graphical Models: So far

In the last segment we learned about log-linear models and indicator functions.

Log-linear models are important because they allow to capture factors more concisely, the indicator functions can be directly extracted from data, and the weights can be learned (to fit the optimization objective, discussed later).

142

Hands-on

143

144

function chunkData(e, t) var n = []; var r = e.length; var i = 0; for (; i < r; i += t) if (i + t < r) n.push(e.substring(i, i + t)); else n.push(e.substring(i, r)); return n;



t r i

length

Unknown

facts:

Known

facts:

i t

r length

…

…

145


t r i

length

Unknown

facts:

Known

facts:

i t

r length

i t w

i step 0.5

j step 0.4

i r w

i len 0.6

i length 0.3

r length w

length length 0.5

len length 0.3

…

…

i

t

r length



t r i

length

Unknown

facts:

Known

facts:

i t

r length

i t w

i step 0.5

j step 0.4

i r w

i len 0.6

i length 0.3

r length w

length length 0.5

len length 0.3

…

…

i

t

r length

argmax 𝒘𝑇𝒇(𝑖, 𝑡, 𝑟, 𝑙𝑒𝑛𝑔𝑡ℎ)



t r i

length

Unknown

facts:

Known

facts:

i t

r length

i t w

i step 0.5

j step 0.4

i r w

i len 0.6

i length 0.3

r length w

length length 0.5

len length 0.3

…

…

i

t

r length i

len

step




t r i

length

Unknown

facts:

Known

facts:

i t

r length

i t w

i step 0.5

j step 0.4

i r w

i len 0.6

i length 0.3

r length w

length length 0.5

len length 0.3

…

…

function chunkData(str, step) var colNames = []; var len = str.length; var i = 0; for (; i < len; i += step) if (i + step < len) colNames.push(str.substring(i, i + step)); else colNames.push(str.substring(i, len)); return colNames;

i

t

r length i

len

step










Tutorial Outline

150

Queries: MAP vs. Max Marginals

MAP Inference: (𝑦1, … , 𝑦𝑛)MAP = argmax 𝑃(𝑦1, … , 𝑦𝑛) 𝑦1, … , 𝑦𝑛

Max Marginals: (𝑦1, … , 𝑦𝑛)ML = (𝑦1ML, … , 𝑦𝑛ML)

𝑦1ML = argmax 𝑃(𝑦1)

𝑦1

𝑦nML = argmax 𝑃(𝑦𝑛)

𝑦𝑛 ……

VS.

151





𝑦1


𝑦𝑛 ……

VS.

A B val

0 0 0.25

0 1 0.3

1 0 0.10

1 1 0.35

Consider the following probability distribution:

152





𝑦1


𝑦𝑛 ……

VS.

A B val

0 0 0.25

0 1 0.3

1 0 0.10

1 1 0.35


A argmax 𝑃(A)

argmax 𝑃(A, B) A, B B

argmax 𝑃(B) 153





𝑦1


𝑦𝑛 ……

VS.

A B val

0 0 0.25

0 1 0.3

1 0 0.10

1 1 0.35


(1,1) MAP (0,1) ML

A argmax 𝑃(A)

argmax 𝑃(A, B) A, B B

argmax 𝑃(B) 154





𝑦1


𝑦𝑛 ……

VS.

Baseline 25.3%

Max Marginals 54.1%

MAP Inference 63.4%

155

MAP Inference

Bad news: computing the argmax is in general NP-hard (Max-SAT)

Good news: many approximate methods exists • Gibbs sampling

• Expectation Maximization

• Variational methods

• Elimination Algorithm

• Junction-Tree algorithm

• Loopy Belief Propagation

156

MAP Inference








157

Bad news: still too slow for learning

𝑂(𝑁 ∙ 𝑀𝑁)

𝑁: number of neighbors 𝑀: number of states

MAP Inference








158

Bad news: still too slow for learning

𝑂(𝑁 ∙ 𝑀𝑁)

𝑁: number of neighbors 𝑀: number of states

We designed an approximate one to fit our needs , i.e., deal with many values

Structured SVM Training [N. Ratliff, J. Bagnell, M. Zinkevich, AISTATS 2007]

Given a data set: D = { xj , yj }j=1..n learn weights 𝒘𝑇

𝑃 𝑦 𝑥 =1

𝑍(𝑥)exp (𝒘𝑇𝒇(𝑦, 𝑥))

159

Structured SVM Training

Optimization objective (max-margin training):

∀j ∀y Σ 𝑤ifi(x(j),y(j)) ≥ Σ 𝑤ifi(x(j),y) + 𝚫(y,y(j))

[N. Ratliff, J. Bagnell, M. Zinkevich, AISTATS 2007]


𝑃 𝑦 𝑥 =1


160






for all samples

Given prediction is better than any other prediction by a margin

𝑃 𝑦 𝑥 =1


161






for all samples

Given prediction is better than any other prediction by a margin

Avoids expensive computation of the partition function 𝑍(𝑥)

𝑃 𝑦 𝑥 =1


162

…

Learning Phase

SSVM

learning

program

analysis

transform

Prediction Phase

program

analysis

var n = []; var r = e.length; var i = 0; for (; i < r; i += t) if (i + t < r) n.push(e.subs(i, i + t)); else n.push(e.subs(i, r)); return n;

var colNames = []; var len = str.length; var i = 0; for (; i < len; i += step) if (i + step < len) colNames.push(str.subs(i, i + step)); else colNames.push(str.subs(i, len)); return colNames;

MAP

inference

alias,call

analysis 7M feature functions for names

70K feature functions for types

max-margin

training

150MB

30 nodes, 400 edges

Time: milliseconds

Names: 63% Types: 81% (helps typechecking)

Conditional Random Field

𝑃 𝑦 𝑥

Structured Prediction for Programs (V. Raychev, M. V., A. Krause, ACM POPL’15)

164







Tutorial Outline

165

Models of Full Language

• So far:

– N-gram and RNNs

– Modelling specific elements (APIs)

• How can be build statistical language models of full programming language?

166

Program Representation elem.notify({

position: ‘top’,

autoHide: false,

delay: 100

});

167

Sequences

elem, ., notify, (, {,

position, :, ‘top’, ‘,’,

autoHide, :, false, ‘,’,

delay, :, 100,

}, ), ;



autoHide: false,

delay: 100

});

Sequences

elem, ., notify, (, {,

position, :, ‘top’, ‘,’,

autoHide, :, false, ‘,’,

delay, :, 100,

}, ), ;

168

Learned Model

?, ?, ?, (, {,

?, :, ?, ‘,’,

?, :, ?, ‘,’,

?, :, ?,

}, ?, ;

Statistical Code Synthesis: Existing Approaches

x arg max P( x | )

promise

defer

reject

Semantic

Hard-coded heuristics

Task & Language specific

x arg max P( x | )

label

(features)

conditioning context

Syntactic

Bad fit for

programs

[Nguyen et al., 2013]

[Allamanis et al., 2014]

[Raychev et al., 2014]

[Hindle et al., 2012]

[Allamanis et al., 2015]

169



autoHide: false,

delay: 100

});

Identifier

elem

Property

notify

Property

position

Property

autoHide

MemberExpression ObjectExpression

CallExpression

Property

delay

String

‘top’

Boolean

false

Trees

170

Probabilistic Context Free Grammars

• 𝑁 is a finite set of non-terminal symbols

• Σ is a finite set of terminal symbols

• 𝑅 is a finite set of production rules 𝛼 → 𝛽1…𝛽𝑛

• 𝑆 ∈ 𝑁 is a start symbol

• 𝑞: 𝑅 → ℝ 0,1 is a conditional probability of choosing given production rule

𝛼 ∈ 𝑁 𝛽𝑖 ∈ (𝑁⋃Σ)

171


Identifier

elem

Property

notify

Property

position

Property

autoHide

MemberExpression ObjectExpression

CallExpression

Property

delay

String

‘top’

Boolean

false

CallExpr → MemberExpr [ elem.notify() ]

CallExpr → MemberExpr Identifier [ elem.notify(x) ]

CallExpr → Identifier [ notify() ]

CallExpr → MemberExpr ObjectExpr [ elem.notify({…}) ]

172





𝑞 𝛼 → 𝛽1…𝛽𝑛 ≈ #(𝛼 → 𝛽1 …𝛽𝑛)

#(𝛼)

training

∀𝛼 𝑞 𝛼 → 𝛽1…𝛽𝑛 = 1

𝛼→𝛽1…𝛽𝑛∈𝑅

valid probability distribution 173


poor context

for prediction

non-trivial to obtain

good production rules

[Allamanis & Sutton, FSE’14] [Gvero & Kuncak, OOPSLA’15]

[Maddison & Tarlow, ICML’14]

[Liang et.al., ICML’10]




174

• Probabilistic Grammars

+ Easy to implement and cheap to train + Defined over generic AST representation of programs - Important to deal with sparsity via smoothing - Non-trivial to define good production rules - Production rules selected based on limited context (e.g., parent non-terminal symbol)

Lessons

175

• Probabilistic Grammars

+ Easy to implement and cheap to train + Defined over generic AST representation of programs - Important to deal with sparsity via smoothing - Non-trivial to define good production rules - Production rules selected based on limited context (e.g., parent non-terminal symbol)

Lessons

176

PCFG are still poor at modelling full

programming language but are

a step towards the solution.

Our Goal

Probabilistic

Model

High

Precision

Efficient

Learning

Widely

Applicable

Existing Programs Learning Model

Explainable

Predictions

177

Key Idea

178

P( wi | context )

N-grams

RNNs

CRFs

PCFGs

So far: Context defined manually for each model

Key Idea

179

P( wi | context )

N-grams

RNNs

CRFs

PCFGs

So far: Context defined manually for each model

Goal: Use program synthesis and machine learning techniques to learn best context for given query wi

PHOG: Concepts

Use function to build a probabilistic model.

Generalizes PCFGs to allow conditioning on richer

context.

Program synthesis learns a function that explains the

data. The function returns a conditioning context for a

given query.

180

Generalizing PCFG

Context Free Grammar

𝜶 → 𝜷1… 𝜷n

P

Property → x 0.05

Property → y 0.03

Property → promise 0.001

181

PHOG: Generalizes PCFG

Higher Order Grammar

𝜶[𝜸] → 𝜷1… 𝜷n

P

Property[reject, promise] → promise 0.67

Property[reject, promise] → notify 0.12

Property[reject, promise] → resolve 0.11

Context Free Grammar

𝜶 → 𝜷1… 𝜷n

P

Property → x 0.05

Property → y 0.03

Property → promise 0.001

182

𝜶[𝜸] → 𝜷1… 𝜷n

Conditioning on Richer Context

What is the best conditioning context?

183

𝜶[𝜸] → 𝜷1… 𝜷n



- Control Structures

- …

- Identifiers

- Constants

- APIs

- Fields

184

𝜶[𝜸] → 𝜷1… 𝜷n



- Control Structures

- …

- Identifiers

- Constants

- APIs

- Fields

? 𝜸

Source

Code

Conditioning

Context 185

Production Rules R:

𝜶[𝜸] → 𝜷1… 𝜷n

Function:

f: → 𝜸


Parametrize the grammar by a function

used to dynamically obtain the context

186

Production Rules R:

𝜶[𝜸] → 𝜷1… 𝜷n

Function:

f: AST → 𝜸


Parametrize the grammar by a function

used to dynamically obtain the context

187


𝜸 （） f

Production Rules R:

𝜶[𝜸] → 𝜷1… 𝜷n

Function:

f: AST → 𝜸

Source

Code

Conditioning

Context

Abstract

Syntax Tree

Function

Application 188

Function Representation

In general:

Unrestricted programs (Turing complete)

Our Work:

TCond Language for navigating over trees

and accumulating context

TCond ::= 𝜀 | WriteOp TCond | MoveOp TCond

MoveOp ::= Up, Left, Right, DownFirst, DownLast,

NextDFS, PrevDFS, NextLeaf, PrevLeaf,

PrevNodeType, PrevNodeValue, PrevNodeContext

WriteOp ::= WriteValue, WriteType, WritePos

189

Expressing functions: TCond Language

Up

Left


MoveOp ::= Up, Left, Right, DownFirst, DownLast,

NextDFS, PrevDFS, NextLeaf, PrevLeaf,

PrevNodeType, PrevNodeValue, PrevNodeContext

WriteOp ::= WriteValue, WriteType, WritePos

WriteValue

𝜸 ← 𝜸 ∙

190

Example

elem.notify(

... ,

... ,

{


hide: false,

?

}

);

𝜸

Query TCond

191

Example

elem.notify(

... ,

... ,

{


hide: false,

?

}

);

TCond

Left

WriteValue

𝜸

{}

{hide}

Query

192

Example

elem.notify(

... ,

... ,

{


hide: false,

?

}

);

TCond

Left

WriteValue

Up

WritePos

𝜸

{}

{hide}

{hide}

{hide, 3}

Query

193

Example

elem.notify(

... ,

... ,

{


hide: false,

?

}

);

TCond

Left

WriteValue

Up

WritePos

Up

DownFirst

DownLast

WriteValue

𝜸

{}

{hide}

{hide}

{hide, 3}

{hide, 3}

{hide, 3}

{hide, 3}

{hide, 3, notify}

Query

194

elem.notify(

... ,

... ,

{


hide: false,

?

}

);

Example

{ Previous Property, Parameter Position, API name }

TCond

Left

WriteValue

Up

WritePos

Up

DownFirst

DownLast

WriteValue

𝜸

{}

{hide}

{hide}

{hide, 3}

{hide, 3}

{hide, 3}

{hide, 3}

{hide, 3, notify}

Query

195

Learning PHOG


MoveOp ::= Up, Left, Right, ...

WriteOp ::= WriteValue, WriteType, ...

fbest = arg min cost(D, f) f ∊ TCond

Existing Dataset

TCond Language

196

Synthesis: Practically Intractable

• Millions (≈ 108) of input/output examples

𝑓𝑏𝑒𝑠𝑡 = argmin 𝑐𝑜𝑠𝑡(𝐷, 𝑓) f ∈ 𝑇𝐶𝑜𝑛𝑑

𝑂( 𝐷 ) computing cost

Synthesis: practically intractable

Key Idea: iterative synthesis on fraction of examples

197

Solution: Two Components

Program Generator

Dataset Sampler

𝑓𝑏𝑒𝑠𝑡 = argmin 𝑐𝑜𝑠𝑡(𝑑, 𝑓) f ∈ 𝑇𝐶𝑜𝑛𝑑

representative dataset sampler picks dataset 𝑑 ⊆ 𝐷

198

In a Loop

• Start with small number of samples 𝑑 ⊆ 𝐷

• Iteratively generate programs and samples

Program Generator

Dataset Sampler

Program Generator

Dataset Sampler

𝑝1, 𝑝2

Dataset Sampler 𝑑1 ⊆ 𝐷

𝑑2 ⊆ 𝐷

𝑝1

𝑝𝑏𝑒𝑠𝑡

199

Representative Dataset Sampler

Idea: pick a small dataset 𝑑 for which a set of already generated programs 𝑝1…𝑝𝑛 behave like on full dataset 𝐷

𝑑 = arg𝑚𝑖𝑛𝑑⊆𝐷𝑚𝑎𝑥𝑖 ∈1 …𝑛 𝑐𝑜𝑠𝑡 𝑑, 𝑝𝑖 − 𝑐𝑜𝑠𝑡(𝐷, 𝑝𝑖)

cost on 𝑑

cost on 𝐷

𝑝1

𝑝1

𝑝2

𝑝2

200

Representative Dataset Sampler

Idea: pick a small dataset 𝑑 for which a set of already generated programs 𝑝1…𝑝𝑛 behave like on full dataset 𝐷

Theorem: The sampler shrinks search space of candidate programs

𝑑 = arg𝑚𝑖𝑛𝑑⊆𝐷𝑚𝑎𝑥𝑖 ∈1 …𝑛 𝑐𝑜𝑠𝑡 𝑑, 𝑝𝑖 − 𝑐𝑜𝑠𝑡(𝐷, 𝑝𝑖)

201

Results

Probabilistic Model of JavaScript Language

50k Blind Set 100k Training 20k Learning

202

Results: JavaScript APIs

Last two tokens, Hindle et. al. [ICSE’12]

Last two APIs, Raychev et. al. [PLDI’14]

Program synthesis

Program synthesis + dataset sampler

Conditioning program p Accuracy

22.2%

30.4%

46.3%

50.4%

203

Results: JavaScript Structure

PCFG

N-gram

Naïve Bayes

Program synthesis

Model Accuracy

51.1%

71.3%

44.2%

81.5%

SVM 70.5%

204

Results: JavaScript Values

Identifier

Property

String

Number

Accuracy Example

RegExp

UnaryExpr

BinaryExpr

LogicalExpr

62%

65%

52%

64%

66%

97%

74%

92%

contains = jQuery …

start = list.length;

‘[‘ + attrs + ‘]’

canvas(xy[0], xy[1], …)

line.replace(/( | )+/, …)

if (!events || !…)

while (++index < …)

frame = frame || …

205







Tutorial Outline

206

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation



SVM Structured SVM

Neural Networks

Sequences

(sentences) Trees

Translation Table

Feature Vectors

control-flow


argmax P(y | x)

y ∈ Ω



More information

and tutorials at:

Greedy

MAP inference








208



Type Prediction

...

for x in range(a):

print a[x]

Debug code:



Code Completion



?

likely

error



www.jsnice.org

All of these benefit from the probabilistic model for code.

www.jsnice.org

www.jsnice.org

Applications

Intermediate

Representation

Analyze Program

(PL)

Train Model

(ML)

Query Model

(ML)

Code completion

Deobfuscation

Program

synthesis Feedback

generation

Translation



SVM Structured SVM

Neural Networks

Sequences

(sentences) Trees

Translation Table

Feature Vectors

control-flow


argmax P(y | x)

y ∈ Ω



More information

and tutorials at:

Greedy

MAP inference







Materials

Dagstuhl Seminar on Big Code Analytics, Nov 2015 http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=15472

– ML, NLP, PL, SE

– Link to materials, people in the general area

http://learningfrombigcode.org

– data sets, tools, challenges

http://nice2predict.org

– fully open source, online demo, build your own tool

210

http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=15472

http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=15472

http://learningfrombigcode.org/



Martin

Vechev

Veselin

Raychev

Pavol

Bielik Christine

Zeller

Svetoslav

Karaivanov

Pascal

Roos

Benjamin

Bischel Andreas

Krause

More information: http://www.srl.inf.ethz.ch/

Probabilistically likely solutions to problems impossible to solve otherwise

Timon

Gehr

Publications

PHOG: Probabilistic Mode for Code, ACM ICML’16

Learning Programs from Noisy Data, ACM POPL’16

Predicting Program Properties from “Big Code”, ACM POPL’15

Code Completion with Statistical Language Models, ACM PLDI’14

Machine Translation for Programming Languages, ACM Onward’14

http://JSNice.org

statistical de-obfuscation

http://Nice2Predict.org

structured prediction framework

Publicly Available Tools

Petar

Tsankov

Probabilistic Learning from Big Code



http://jsnice.org/


machine learning for programs - github pages · learning and probabilistic models based on big data...

Documents