machine learning for programs - github pages · learning and probabilistic models based on big data...
TRANSCRIPT
Martin Vechev Pavol Bielik
Department of Computer Science
ETH Zurich
Machine Learning for Programs
Learning and probabilistic models based on Big Data have revolutionized entire fields
Natural Language Processing (e.g., machine translation)
Computer Vision (e.g., image captioning)
A group of people shopping at an outdoor market. There are many vegetables at the fruit stand.
Medical Computing (e.g., disease prediction)
Can we bring this revolution to programmers?
Learning and probabilistic models based on Big Data have revolutionized entire fields
Natural Language Processing (e.g., machine translation)
Computer Vision (e.g., image captioning)
A group of people shopping at an outdoor market. There are many vegetables at the fruit stand.
Medical Computing (e.g., disease prediction)
Vision New kinds of techniques and tools that learn from Big Code
Statistical Programming Tool
Programming Task
Solution
probabilistic model
15 million repositories
Billions of lines of code
High quality, tested,
maintained programs
last 8 years
number of repositories
Martin
Vechev
Veselin
Raychev
Pavol
Bielik Christine
Zeller
Svetoslav
Karaivanov
Pascal
Roos
Benjamin
Bischel Andreas
Krause
More information: http://www.srl.inf.ethz.ch/
Probabilistically likely solutions to problems impossible to solve otherwise
Timon
Gehr
Publications
PHOG: Probabilistic Mode for Code, ACM ICML’16
Learning Programs from Noisy Data, ACM POPL’16
Predicting Program Properties from “Big Code”, ACM POPL’15
Code Completion with Statistical Language Models, ACM PLDI’14
Machine Translation for Programming Languages, ACM Onward’14
http://JSNice.org
statistical de-obfuscation
http://Nice2Predict.org
structured prediction framework
Publicly Available Tools
Petar
Tsankov
Probabilistic Learning from Big Code
• Motivation • Applications
• Statistical Language Models
• N-gram, Recurrent Networks, PCFGs • Application: code completion
• Graphical Models
• Markov Networks, Conditional Random Fields • Inference & Learning in Markov Networks • Application: predicting names and types
• Learning Features for Programs • Combining Program Synthesis + Machine Learning
Tutorial Outline
45 min
15 min break
60 min
15 min break
45 min
Statistical Programming Tools
7
Write new code [PLDI’14]:
Code Completion
Camera camera = Camera.open();
camera.SetDisplayOrientation(90);
?
Statistical Programming Tools
8
Write new code [PLDI’14]:
Code Completion
Camera camera = Camera.open();
camera.SetDisplayOrientation(90);
?
Port code [ONWARD’14]:
Programming Language Translation
Statistical Programming Tools
9
...
for x in range(a):
print a[x]
Debug code:
Statistical Bug Detection
Write new code [PLDI’14]:
Code Completion
Camera camera = Camera.open();
camera.SetDisplayOrientation(90);
?
likely
error
Port code [ONWARD’14]:
Programming Language Translation
Statistical Programming Tools
10
Understand code/security [POPL’15]:
JavaScript Deobfuscation
Type Prediction
...
for x in range(a):
print a[x]
Debug code:
Statistical Bug Detection
Write new code [PLDI’14]:
Code Completion
Camera camera = Camera.open();
camera.SetDisplayOrientation(90);
?
likely
error
Port code [ONWARD’14]:
Programming Language Translation
www.jsnice.org
JSNice.org
Every country Top ranked tool 150,000+ users
[Predicting program properties from Big Code, ACM POPL 2015]
Statistical Programming Tools
12
Understand code/security [POPL’15]:
JavaScript Deobfuscation
Type Prediction
...
for x in range(a):
print a[x]
Debug code:
Statistical Bug Detection
Write new code [PLDI’14]:
Code Completion
Camera camera = Camera.open();
camera.SetDisplayOrientation(90);
?
likely
error
Port code [ONWARD’14]:
Programming Language Translation
www.jsnice.org
All of these benefit from the probabilistic models for code.
• Motivation • Applications
• Statistical Language Models
• N-gram, Recurrent Networks, PCFGs • Application: code completion
• Graphical Models
• Markov Networks, Conditional Random Fields • Inference & Learning in Markov Networks • Application: predicting names and types
• Learning Features for Programs • Combining Program Synthesis + Machine Learning
Tutorial Outline
13
Applications
Intermediate
Representation
Analyze Program
(PL)
Train Model
(ML)
Query Model
(ML)
Machine Learning for Programs
14
Applications
Intermediate
Representation
Analyze Program
(PL)
Train Model
(ML)
Query Model
(ML)
Code completion
Deobfuscation
Program
synthesis Feedback
generation
Translation
Machine Learning for Programs
15
Applications
Intermediate
Representation
Analyze Program
(PL)
Train Model
(ML)
Query Model
(ML)
Code completion
Deobfuscation
Program
synthesis Feedback
generation
Translation
Sequences
(sentences) Trees
Translation Table
Feature Vectors
Graphical Models (CRFs)
Machine Learning for Programs
16
Applications
Intermediate
Representation
Analyze Program
(PL)
Train Model
(ML)
Query Model
(ML)
Code completion
Deobfuscation
Program
synthesis Feedback
generation
Translation
alias analysis typestate analysis
Sequences
(sentences) Trees
Translation Table
Feature Vectors
control-flow
analysis scope analysis
Graphical Models (CRFs)
Machine Learning for Programs
17
Applications
Intermediate
Representation
Analyze Program
(PL)
Train Model
(ML)
Query Model
(ML)
Code completion
Deobfuscation
Program
synthesis Feedback
generation
Translation
alias analysis typestate analysis
N-gram language model
SVM Structured SVM
Neural Networks
Sequences
(sentences) Trees
Translation Table
Feature Vectors
control-flow
analysis scope analysis
Graphical Models (CRFs)
Machine Learning for Programs
18
Applications
Intermediate
Representation
Analyze Program
(PL)
Train Model
(ML)
Query Model
(ML)
Code completion
Deobfuscation
Program
synthesis Feedback
generation
Translation
alias analysis typestate analysis
N-gram language model
SVM Structured SVM
Neural Networks
Sequences
(sentences) Trees
Translation Table
Feature Vectors
control-flow
analysis scope analysis
argmax P(y | x)
y ∈ Ω
Greedy
MAP inference
Graphical Models (CRFs)
Machine Learning for Programs
19
Applications
Intermediate
Representation
Analyze Program
(PL)
Train Model
(ML)
Query Model
(ML)
Code completion
Deobfuscation
Program
synthesis Feedback
generation
Translation
alias analysis typestate analysis
N-gram language model
SVM Structured SVM
Neural Networks
Sequences
(sentences) Trees
Translation Table
Feature Vectors
control-flow
analysis scope analysis
argmax P(y | x)
y ∈ Ω
Greedy
MAP inference
Graphical Models (CRFs)
Machine Learning for Programs
20
Motivation: Working with APIs
21
Camera camera = Camera.open(); camera.setDisplayOrientation(90); camera.unlock(); MediaRecorder rec = new MediaRecorder(); rec.setCamera(camera); rec.setAudioSource(MediaRecorder.AudioSource.MIC); rec.setVideoSource(MediaRecorder.VideoSource.DEFAULT); rec.setOutputFormat(MediaRecorder.OutputFormat.MPEG_4); rec.setAudioEncoder(1); rec.setVideoEncoder(3); rec.setOutputFile("file.mp4"); ...
?
?
?
Statistical Code Synthesis: Capabilities [Code Completion with Statistical Language Models, V. Raychev, M. Vechev, E. Yahav ,PLDI’14]
22
Camera camera = Camera.open(); camera.setDisplayOrientation(90); camera.unlock(); MediaRecorder rec = new MediaRecorder(); rec.setCamera(camera); rec.setAudioSource(MediaRecorder.AudioSource.MIC); rec.setVideoSource(MediaRecorder.VideoSource.DEFAULT); rec.setOutputFormat(MediaRecorder.OutputFormat.MPEG_4); rec.setAudioEncoder(1); rec.setVideoEncoder(3); rec.setOutputFile("file.mp4"); ...
[Code Completion with Statistical Language Models, V. Raychev, M. Vechev, E. Yahav ,PLDI’14]
Statistical Code Synthesis: Capabilities
23
Camera camera = Camera.open(); camera.setDisplayOrientation(90); camera.unlock(); MediaRecorder rec = new MediaRecorder(); rec.setCamera(camera); rec.setAudioSource(MediaRecorder.AudioSource.MIC); rec.setVideoSource(MediaRecorder.VideoSource.DEFAULT); rec.setOutputFormat(MediaRecorder.OutputFormat.MPEG_4); rec.setAudioEncoder(1); rec.setVideoEncoder(3); rec.setOutputFile("file.mp4"); ...
Handles multiple
objects
[Code Completion with Statistical Language Models, V. Raychev, M. Vechev, E. Yahav ,PLDI’14]
Statistical Code Synthesis: Capabilities
24
Camera camera = Camera.open(); camera.setDisplayOrientation(90); camera.unlock(); MediaRecorder rec = new MediaRecorder(); rec.setCamera(camera); rec.setAudioSource(MediaRecorder.AudioSource.MIC); rec.setVideoSource(MediaRecorder.VideoSource.DEFAULT); rec.setOutputFormat(MediaRecorder.OutputFormat.MPEG_4); rec.setAudioEncoder(1); rec.setVideoEncoder(3); rec.setOutputFile("file.mp4"); ...
Handles multiple
objects
Infers multiple
statements
[Code Completion with Statistical Language Models, V. Raychev, M. Vechev, E. Yahav ,PLDI’14]
Statistical Code Synthesis: Capabilities
25
Camera camera = Camera.open(); camera.setDisplayOrientation(90); camera.unlock(); MediaRecorder rec = new MediaRecorder(); rec.setCamera(camera); rec.setAudioSource(MediaRecorder.AudioSource.MIC); rec.setVideoSource(MediaRecorder.VideoSource.DEFAULT); rec.setOutputFormat(MediaRecorder.OutputFormat.MPEG_4); rec.setAudioEncoder(1); rec.setVideoEncoder(3); rec.setOutputFile("file.mp4"); ...
Handles multiple
objects
Infers sequences
not in training data
Infers multiple
statements
Statistical Code Synthesis: Capabilities [Code Completion with Statistical Language Models, V. Raychev, M. Vechev, E. Yahav ,PLDI’14]
26
Regularities in code are similar to regularities in natural language
Key Insight
27
Regularities in code are similar to regularities in natural language
We want to learn that MediaRecorder rec = new MediaRecorder(); is before
rec.setCamera(camera);
Key Insight
28
Regularities in code are similar to regularities in natural language
We want to learn that MediaRecorder rec = new MediaRecorder(); is before
rec.setCamera(camera); like in natural languages
Hello
is before
World !
Key Insight
29
From Programs to Sentences
she = new X(); me = new Y(); me.sleep(); if (random()) { me.eat(); } she.enter(); me.talk(she);
30
From Programs to Sentences
she = new X(); me = new Y(); me.sleep(); if (random()) { me.eat(); } she.enter(); me.talk(she);
Typestate analysis
31
From Programs to Sentences
she = new X(); me = new Y(); me.sleep(); if (random()) { me.eat(); } she.enter(); me.talk(she);
Typestate analysis Alias analysis
32
From Programs to Sentences
she = new X(); me = new Y(); me.sleep(); if (random()) { me.eat(); } she.enter(); me.talk(she);
Typestate analysis Alias analysis for abstract object me:
33
From Programs to Sentences
she = new X(); me = new Y(); me.sleep(); if (random()) { me.eat(); } she.enter(); me.talk(she);
Typestate analysis Alias analysis for abstract object me: Yinit sleep talk
34
From Programs to Sentences
she = new X(); me = new Y(); me.sleep(); if (random()) { me.eat(); } she.enter(); me.talk(she);
Typestate analysis Alias analysis for abstract object me: Yinit sleep talk
Yinit sleep eat talk
35
From Programs to Sentences
she = new X(); me = new Y(); me.sleep(); if (random()) { me.eat(); } she.enter(); me.talk(she);
Typestate analysis Alias analysis for abstract object me: Yinit sleep talk
Yinit sleep eat talk
for abstract object she:
36
From Programs to Sentences
she = new X(); me = new Y(); me.sleep(); if (random()) { me.eat(); } she.enter(); me.talk(she);
Typestate analysis Alias analysis for abstract object me: Yinit sleep talk
Yinit sleep eat talk
for abstract object she: Xinit enter talkparam1
37
Training phase
Program
Analysis
sentences
From Programs to Sentences
38
Learn Regularities
Learn regularities in obtained sentences
Regularities in sentences ⇔ regularities in API usage
If we see many sequences like: Xinit enter talkparam1
then we should learn that talkparam1 is often after enter
39
Statistical Language Models
Given a sentence s = w1 w2 w3 … wn
estimate P( w1 w2 w3 … wn )
Queries:
P(“Hello World!”) ≫ P(“World! Hello”)
arg𝑚𝑎𝑥 P(“Hello” y)
40
y
Statistical Language Models
Given a sentence s = w1 w2 w3 … wn
estimate P( w1 w2 w3 … wn )
Decomposed to conditional probabilities
P( w1 w2 w3 … wn ) = ∏i=1..n P( wi | w1 … wi-1 )
41
Statistical Language Models
Given a sentence s = w1 w2 w3 … wn
estimate P( w1 w2 w3 … wn )
Decomposed to conditional probabilities
P( w1 w2 w3 … wn ) = ∏i=1..n P( wi | w1 … wi-1 )
P( The quick brown fox jumped ) = P( The ) P( quick | The ) P( brown | The quick ) P( fox | The quick brown ) P( jumped | The quick brown fox ) 42
N-gram language model
Conditional probability only on previous n-1 words
P( wi | w1 … wi-1 ) ≈ P( wi | wi-n+1 … wi-1 )
43
N-gram language model
Conditional probability only on previous n-1 words
P( wi | w1 … wi-1 ) ≈ P( wi | wi-n+1 … wi-1 )
n-1 words
44
N-gram language model
Conditional probability only on previous n-1 words
P( wi | w1 … wi-1 ) ≈ P( wi | wi-n+1 … wi-1 ) ≈
n-1 words
#(wi-n+1 … wi-1 wi )
#(wi-n+1 … wi-1 )
45
N-gram language model
Conditional probability only on previous n-1 words
P( wi | w1 … wi-1 ) ≈ P( wi | wi-n+1 … wi-1 ) ≈
n-1 words
P( jumped | The quick brown fox ) ≈ P( jumped | brown fox ) ≈
#( brown fox jumped )
#( brown fox )
#( n-gram ) - number of
occurrences of n-gram in
training data
Training is achieved by counting n-grams.
E.g., with 3-gram language model, we get:
Time complexity for each word
encountered in training is constant, so
training is usually fast.
#(wi-n+1 … wi-1 wi )
#(wi-n+1 … wi-1 )
46
Tri-gram language model
3-grams # of occurrences
brown fox jumped 125
brown fox walked 45
brown fox snapped 30
P (jumped | brown fox) ~= 125/200 ~= 0.625
P (w1 w2 w3 … wn) ~= P (w1) P (w2 w1) P (w3 w1 w2) … P (wn wn-2 wn-1)
47
Tri-gram language model
3-grams # of occurrences
brown fox jumped 125
brown fox walked 45
brown fox snapped 30
P (jumped | brown fox) ~= 125/200 ~= 0.625
P (brown fox jumped) ~=
P (brown) P (fox | brown) P (jumped | brown fox)
200 / 600 200 / 200 125 / 200 ~= 0.208
P (w1 w2 w3 … wn) ~= P (w1) P (w2 w1) P (w3 w1 w2) … P (wn wn-2 wn-1)
48
Key Problem: Sparsity of Data
The problem of sparsity gets worse as the size of the n-gram becomes larger.
We need to handle n-grams with 0 or few occurrences in the training data.
Techniques that can do that are: smoothing, discounting
#( red fox jumped )
#( red fox )
What if this number is 0?
P( jumped | red fox ) ≈
49
• Smoothing techniques give non-zero probability to n-grams not in
the training data.
• Essentially, they try to estimate how likely it is that the n-gram is
missing due to the limited size of the training data.
• They work by taking the probability mass of the existing n-grams
and redistributing that mass over n-grams that occur zero times.
• Typically, probability of such n-grams is estimated by looking at
the probabilities of n-1 grams , n-2 grams, unigrams, etc.
Solution: Smoothing Techniques
50
Smoothing: Intuitively
Distribution of
probability mass
before smoothing
Training Data
All other
sentences
get 0 probability
All other
sentences
get non-zero
probability
Distribution of
probability mass
after smoothing
Training Data Smoothing
51
Smoothing: Intuitively
Distribution of
probability mass
before smoothing
Training Data
All other
sentences
get 0 probability
All other
sentences
get non-zero
probability
Distribution of
probability mass
after smoothing
Training Data Smoothing
52
Smoothing Techniques
Example smoothing techniques are: Witten-Bell Interpolated (WBI), Witten-
Bell Backoff (WBB), Natural Discounting (ND), Stupid-Backoff (SB), … 53
Psmooth( wi | wi-N+1) = i-1
PML( wi | wi-n+1 … wi-1)
γ(wi-N+1) Psmooth( wi | wi-n+2 … wi-1) i-1
If wi-n+1 … wi in training data
otherwise
Backoff techniques:
Smoothing Techniques
Psmooth( wi | wi-N+1) = i-1
PML( wi | wi-n+1 … wi-1)
γ(wi-N+1) Psmooth( wi | wi-n+2 … wi-1) i-1
If wi-n+1 … wi in training data
otherwise
Backoff techniques:
Psmooth( wi | wi-N+1) = λ PML( wi | wi-N+1) + ( 1 – λ ) Psmooth( wi | wi-N+2) i-1
Interpolation techniques:
i-1 i-1
Example smoothing techniques are: Witten-Bell Interpolated (WBI), Witten-
Bell Backoff (WBB), Natural Discounting (ND), Stupid-Backoff (SB), … 54
Summary: N-gram Language Model
Existing library implementing language models is SRILM:
http://www.speech.sri.com/projects/srilm/
Conditional probability only on previous n-1 words
P( wi | w1 … wi-1 ) ≈ P( wi | wi-n+1 … wi-1 ) ≈
n-1 words
#(wi-n+1 … wi-1 wi )
#(wi-n+1 … wi-1 )
+ Easy to implement
+ Fast to train (simply counting)
- Data sparsity issues
- Can be tricky to capture long distance relationships
55
Training phase
Program
Analysis
Language
model
Train
Language
Model
sentences
Learn Regularities
N-gram
56
Recurrent Neural Networks (RNNs)
57
0 0 1 0 0 … input word
… hidden layer
(state i)
… output layer
softmax classifier
probability
of “The”
probability
of “brown”
“The”
Recurrent Neural Networks (RNNs)
58
0 0 1 0 0 … input word
… hidden layer
(state i)
… output layer
softmax classifier
probability
of “The”
probability
of “brown”
1 0 0 0 0 …
…
…
…
“The” “quick”
Recurrent Neural Networks (RNNs)
59
0 0 1 0 0 … input word
… hidden layer
(state i)
… output layer
softmax classifier
probability
of “The”
probability
of “brown”
1 0 0 0 0 …
…
…
…
U
V
W
U
V W
“The” “quick”
Recurrent Neural Networks (RNNs)
60
0 0 1 0 0 … input word
… hidden layer
(state i)
… output layer
softmax classifier
probability
of “The”
probability
of “brown”
… …
U
V
W
U
V W
“The” “quick”
Matrices U, V and W
learned through back
propagation with time
and some unfolding.
Recurrent Neural Networks (RNNs)
61
0 0 1 0 0 … input word
… hidden layer
(state i)
… output layer
softmax classifier
probability
of “The”
probability
of “brown”
… …
U
V
W
U
V W
“The” “quick”
Matrices U, V and W
learned through back
propagation with time
and some unfolding.
Matrix U learns a low-
dimensional continuous
vector representation of
input words also referred
to as “word embedding”.
Word Embeddings
[Linguistic Regularities in Continuous
Space Word Representations, Mikolov
et.al., 2013 (NAACL-HLT-2013)]
62
Matrices U, V and W
learned through back
propagation with time
and some unfolding.
Matrix U learns a low-
dimensional continuous
vector representation of
input words also referred
to as “word embedding”.
Summary: RNNs
+ RNNs can learn dependencies beyond
the prior several words (LSTM networks)
+ Active research area
+ Learns continuous representation
of input vocabulary
- Slow to train
- Difficult to train
- Not interpretable
word(i) word(i+1)
U
V
W
state(i)
state(i-1)
hidden layer input word output words
http://rnnlm.org/
RNNLM-40 Download:
https://www.tensorflow.org/ 63
Training phase
Program
Analysis
Language
model
Train
Language
Model
sentences
Learn Regularities
N-gram
+
RNN
64
Camera camera = Camera.open();
camera.setDisplayOrientation(90);
camera.unlock();
MediaRecorder rec = new
MediaRecorder();
rec.setCamera(camera);
rec.setAudioSource(MediaRecorder.A
udioSource.MIC);
rec.setVideoSource(MediaRecorder.Vi
deoSource.DEFAULT);
rec.setOutputFormat(MediaRecorder.
OutputFormat.MPEG_4);
rec.setAudioEncoder(1);
rec.setVideoEncoder(3);
rec.setOutputFile("file.mp4");
...
?
?
?
Camera camera = Camera.open();
camera.setDisplayOrientation(90);
camera.unlock();
MediaRecorder rec = new
MediaRecorder();
rec.setCamera(camera);
rec.setAudioSource(MediaRecorder.A
udioSource.MIC);
rec.setVideoSource(MediaRecorder.Vi
deoSource.DEFAULT);
rec.setOutputFormat(MediaRecorder.
OutputFormat.MPEG_4);
rec.setAudioEncoder(1);
rec.setVideoEncoder(3);
rec.setOutputFile("file.mp4");
...
incomplete program
completed
program
Training phase
Program
Analysis
Train
Language
Model
Language
model sentences
N-gram
+
RNN
The SLANG System
65
Statistical Code Synthesis
smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); ? {smsMgr, list} // (Hole H1) } else { ? {smsMgr, message} // (Hole H2) }
66
Statistical Code Synthesis
smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); ? {smsMgr, list} // (Hole H1) } else { ? {smsMgr, message} // (Hole H2) }
Abstract object smsMgr:
getDefaultresult divideMessage H1 getDefaultresult H2
67
Statistical Code Synthesis
smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); ? {smsMgr, list} // (Hole H1) } else { ? {smsMgr, message} // (Hole H2) }
Abstract object smsMgr:
getDefaultresult divideMessage H1 getDefaultresult H2
68
Statistical Code Synthesis
smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); ? {smsMgr, list} // (Hole H1) } else { ? {smsMgr, message} // (Hole H2) }
Abstract object list:
divideMessageresult H1
Abstract object smsMgr:
getDefaultresult divideMessage H1 getDefaultresult H2
69
Statistical Code Synthesis
smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); ? {smsMgr, list} // (Hole H1) } else { ? {smsMgr, message} // (Hole H2) }
Abstract object list:
divideMessageresult H1
Abstract object smsMgr:
getDefaultresult divideMessage H1 getDefaultresult H2
70
Statistical Code Synthesis
smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); ? {smsMgr, list} // (Hole H1) } else { ? {smsMgr, message} // (Hole H2) }
Abstract object list:
divideMessageresult H1
Abstract object smsMgr:
getDefaultresult divideMessage H1 getDefaultresult H2
Abstract object message: length divideMessageparam1
length H2
71
Statistical Code Synthesis
smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); ? {smsMgr, list} // (Hole H1) } else { ? {smsMgr, message} // (Hole H2) }
Abstract object list:
divideMessageresult H1
Abstract object smsMgr:
getDefaultresult divideMessage H1 getDefaultresult H2
Abstract object message: length divideMessageparam1
length H2
72
Statistical Code Synthesis
arg𝑚𝑎𝑥 P(getDefaultresult divideMessage H1) H1
getDefaultresult divideMessage H1
Abstract object smsMgr:
Statistical Code Synthesis
arg𝑚𝑎𝑥 P(getDefaultresult divideMessage H1) H1
getDefaultresult divideMessage sendMultipartTextMessage 0.0033
Completion probability Abstract object smsMgr:
Statistical Code Synthesis
Completion probability
getDefaultresult divideMessage sendMultipartTextMessage 0.0033
getDefaultresult divideMessage sendTextMessage 0.0016
Abstract object smsMgr:
getDefaultresult divideMessage sendMultipartTextMessage 0.0033
getDefaultresult divideMessage sendTextMessage 0.0016
getDefaultresult sendTextMessage 0.0073
getDefaultresult sendMultipartTextMessage 0.0010
divideMessageresult sendMultipartTextMessageparam3 0.0821
length length 0.0132
length split 0.0080
length sendTextMessageparam3 0.0017
Completion probability
Statistical Code Synthesis
Abstract object smsMgr:
getDefaultresult divideMessage sendMultipartTextMessage 0.0033
getDefaultresult divideMessage sendTextMessage 0.0016
getDefaultresult sendTextMessage 0.0073
getDefaultresult sendMultipartTextMessage 0.0010
divideMessageresult sendMultipartTextMessageparam3 0.0821
length length 0.0132
length split 0.0080
length sendTextMessageparam3 0.0017
Completion probability
Statistical Code Synthesis
Abstract object smsMgr:
getDefaultresult divideMessage sendMultipartTextMessage 0.0033
getDefaultresult divideMessage sendTextMessage 0.0016
getDefaultresult sendTextMessage 0.0073
getDefaultresult sendMultipartTextMessage 0.0010
divideMessageresult sendMultipartTextMessageparam3 0.0821
length length 0.0132
length split 0.0080
length sendTextMessageparam3 0.0017
Completion probability
Not a feasible solution:
completions disagree on
selected method
The solution must satisfy
program constraints
Statistical Code Synthesis
Abstract object smsMgr:
getDefaultresult divideMessage sendMultipartTextMessage 0.0033
getDefaultresult divideMessage sendTextMessage 0.0016
getDefaultresult sendTextMessage 0.0073
getDefaultresult sendMultipartTextMessage 0.0010
divideMessageresult sendMultipartTextMessageparam3 0.0821
length length 0.0132
length split 0.0080
length sendTextMessageparam3 0.0017
Completion probability
Statistical Code Synthesis
Abstract object smsMgr:
getDefaultresult divideMessage sendMultipartTextMessage 0.0033
getDefaultresult divideMessage sendTextMessage 0.0016
getDefaultresult sendTextMessage 0.0073
getDefaultresult sendMultipartTextMessage 0.0010
divideMessageresult sendMultipartTextMessageparam3 0.0821
length length 0.0132
length split 0.0080
length sendTextMessageparam3 0.0017
Completion probability
Statistical Code Synthesis
Abstract object smsMgr:
smsMgr = SmsManager.getDefault(); int length = message.length(); if (length > MAX_SMS_MESSAGE_LENGTH) { list = smsMgr.divideMessage(message); smsMgr.sendMultipartTextMessage(...list...); } else { smsMgr.sendTextMessage(...message...); }
Statistical Code Synthesis
81
Camera camera = Camera.open();
camera.setDisplayOrientation(90);
camera.unlock();
MediaRecorder rec = new
MediaRecorder();
rec.setCamera(camera);
rec.setAudioSource(MediaRecorder.A
udioSource.MIC);
rec.setVideoSource(MediaRecorder.Vi
deoSource.DEFAULT);
rec.setOutputFormat(MediaRecorder.
OutputFormat.MPEG_4);
rec.setAudioEncoder(1);
rec.setVideoEncoder(3);
rec.setOutputFile("file.mp4");
...
?
?
?
Camera camera = Camera.open();
camera.setDisplayOrientation(90);
camera.unlock();
MediaRecorder rec = new
MediaRecorder();
rec.setCamera(camera);
rec.setAudioSource(MediaRecorder.A
udioSource.MIC);
rec.setVideoSource(MediaRecorder.Vi
deoSource.DEFAULT);
rec.setOutputFormat(MediaRecorder.
OutputFormat.MPEG_4);
rec.setAudioEncoder(1);
rec.setVideoEncoder(3);
rec.setOutputFile("file.mp4");
... Completion phase
Program
Analysis Combine Query
sentences
with holes
completed
sentences incomplete program
completed
program
Training phase
Program
Analysis
Train
Languag
e Model
Language
model sentences
1M Java
methods
~700MB
~100MB
84 testing
samples
Correct completion
in top 3 for ~90% of
cases
The SLANG System
82
• Connection to PL • decide what to ``compile’’ into the word
• Program Analysis need not be sound
• N-gram Models
• Easy and cheap to train • Important to deal with sparsity via smoothing • Can be tricky to capture long distance relationships
• Recurrent Networks • Can capture longer distance relationships • More expensive to train • Experimentally similar to language models
Lessons
83
1
no alias analysis
with alias analysis
1% 10% 100%
Precision vs. % of Data used
0
0.5
Static Analysis Benefit = 10x more Data
Importance of Static Analysis
Models of Full Language
• So far:
– N-gram and RNNs
– Modelling specific elements (APIs)
• How can be build statistical language models of full programming language?
85
• Motivation • Potential applications
• Statistical Language Models
• N-gram, Recurrent Networks, PCFGs • Application: code completion
• Graphical Models
• Markov Networks, Conditional Random Fields • Inference & Learning in Markov Networks • Application: predicting names and types
• Learning Features for Programs • Combining Program Synthesis + Machine Learning
Tutorial Outline
86
Applications
Natural Language Processing
the dog saw a man
in the park
S
NP VP
Det N V NP PP
Det N
Det N P NNP
Computer Vision Programming
fun chunkData(e,t)
var n = [];
var r = e.length;
var i = 0;
for(; i < r; i += t)
…
return n;
87
Facts to be predicted are dependent
Millions of candidate choices
Must quickly learn from huge codebases
Prediction should be fast (real time)
Challenges
88
Phrase the problem of predicting program facts as
Structured Prediction for Programs
Key Idea
89
Structured Prediction for Programs [V. Raychev, M. Vechev, A. Krause, ACM POPL’15]
Program
Analysis
Graphical
Models
Inference
Learning
First connection between Programs and Conditional Random Fields
Knowledge
Reasoning
Applications
Intermediate
Representation
Analyze Program
(PL)
Train Model
(ML)
Query Model
(ML)
Code completion
Deobfuscation
Program
synthesis Feedback
generation
Translation
alias analysis typestate analysis
N-gram language model
SVM Structured SVM
Neural Networks
Sequences
(sentences) Trees
Translation Table
control-flow
analysis scope analysis
argmax P(y | x)
y ∈ Ω
http://www.nice2predict.org/
http://www.srl.inf.ethz.ch/
More information
and tutorials at:
Greedy
MAP inference
Graphical Models (CRFs)
Machine Learning for Programs
91
Log-linear Models Markov Networks
Applications
Intermediate
Representation
Analyze Program
(PL)
Train Model
(ML)
Query Model
(ML)
Code completion
Deobfuscation
Program
synthesis Feedback
generation
Translation
alias analysis typestate analysis
N-gram language model
SVM Structured SVM
Neural Networks
Sequences
(sentences) Trees
Translation Table
Log-linear Models
control-flow
analysis scope analysis
argmax P(y | x)
y ∈ Ω
http://www.nice2predict.org/
http://www.srl.inf.ethz.ch/
More information
and tutorials at:
Greedy
MAP inference
Graphical Models (CRFs)
Machine Learning for Programs
92
Markov Networks
Applications
Intermediate
Representation
Analyze Program
(PL)
Train Model
(ML)
Query Model
(ML)
Code completion
Deobfuscation
Program
synthesis Feedback
generation
Translation
alias analysis typestate analysis
N-gram language model
SVM Structured SVM
Neural Networks
Sequences
(sentences) Trees
Translation Table
control-flow
analysis scope analysis
argmax P(y | x)
y ∈ Ω
http://www.nice2predict.org/
http://www.srl.inf.ethz.ch/
More information
and tutorials at:
Greedy
MAP inference
Graphical Models (CRFs)
Machine Learning for Programs
Log-linear Models Markov Networks
Defining a Global Probabilistic Model
Joint probability distribution
P(A, B, C, D)
94
Defining a Global Probabilistic Model
Joint probability distribution
A B C D
0 0 0 0 P(a0, b0, c0, d0) = 0.04
0 0 0 1 P(a0, b0, c0, d1) = 0.04
0 0 1 0 P(a0, b0, c1, d0) = 0.04
0 0 1 1 P(a0, b0, c1, d1) = 4.1 10-6
… … … … …
P(A, B, C, D)
Table of joint probabilities
95
Defining a Global Probabilistic Model
Joint probability distribution
A B C D
0 0 0 0 P(a0, b0, c0, d0) = 0.04
0 0 0 1 P(a0, b0, c0, d1) = 0.04
0 0 1 0 P(a0, b0, c1, d0) = 0.04
0 0 1 1 P(a0, b0, c1, d1) = 4.1 10-6
… … … … …
P(A, B, C, D)
Table of joint probabilities 𝑛 variables → 𝑂 2𝑛 entries!
96
Joint probability distribution
P(A, B, C, D) = F(A,B,C,D)
F(A, B, C, D) = 1 (A,B) 2 (B,C) 3 (C,D) 4 (D,A)
Define the global model by combining
a set of local interactions between variables
97
Defining a Global Probabilistic Model
Factors
• A factor (or potential) is a function from a set of random variables D to a real number R
: D R
• The set of variables D is the scope of the factor – we are typically concerned with non-negative factors
• Intuition: captures affinity, agreement, compatibility of the variables in D
98
Factors: Example
A B value
0 0 30
0 1 5
1 0 1
1 1 10
• Example factors – 1 (0,0) = 30 says we believe A and B agree on 0 with belief 30
– 3 (C,D) says that C and D argue all the time
1 (A,B)
99
Factors: Example
A B value
0 0 30
0 1 5
1 0 1
1 1 10
• Example factors – 1 (0,0) = 30 says we believe A and B agree on 0 with belief 30
– 3 (C,D) says that C and D argue all the time
1 (A,B)
B C value
0 0 100
0 1 1
1 0 1
1 1 100
2 (B,C)
C D value
0 0 1
0 1 100
1 0 100
1 1 1
3 (C,D)
D A value
0 0 100
0 1 1
1 0 1
1 1 100
4 (D,A)
100
Factors: Example
A B value
0 0 30
0 1 5
1 0 1
1 1 10
1 (A,B)
B C value
0 0 100
0 1 1
1 0 1
1 1 100
2 (B,C)
C D value
0 0 1
0 1 100
1 0 100
1 1 1
3 (C,D)
D A value
0 0 100
0 1 1
1 0 1
1 1 100
4 (D,A)
Key point: (Local) factors have no probabilistic interpretation
• Example factors – 1 (0,0) = 30 says we believe A and B agree on 0 with belief 30
– 3 (C,D) says that C and D argue all the time
Defining a Global Probabilistic Model
Joint probability distribution
P(A, B, C, D) = F(A,B,C,D)
F(A, B, C, D) = 1 (A,B) 2 (B,C) 3 (C,D) 4 (D,A)
102
Defining a Global Probabilistic Model
P(A, B, C, D) = F(A,B,C,D) / Z(A,B,C,D)
F(A, B, C, D) = 1 (A,B) 2 (B,C) 3 (C,D) 4 (D,A)
Z(A,B,C,D) = F(a, b, c, d) a A, b B
c C, d D
Partition function
Joint probability distribution
103
P(A,B,C,D)
A B C D Unnormalized (F)
Normalized (P = F/Z)
0 0 0 0 300,000 0.04
0 0 0 1 300,000 0.04
0 0 1 0 300,000 0.04
0 0 1 1 30 4.1 10-6
… … … … … …
Z(A,B,C,D) = 7,201,840
104
A B value
0 0 30
0 1 5
1 0 1
1 1 10
1 (A,B)
B C value
0 0 100
0 1 1
1 0 1
1 1 100
2 (B,C)
C D value
0 0 1
0 1 100
1 0 100
1 1 1
3 (C,D)
D A value
0 0 100
0 1 1
1 0 1
1 1 100
4 (D,A)
P(A,B,C,D)
A B C D Unnormalized (F)
Normalized (P = F/Z)
0 0 0 0 300,000 0.04
0 0 0 1 300,000 0.04
0 0 1 0 300,000 0.04
0 0 1 1 30 4.1 10-6
… … … … … …
Z(A,B,C,D) = 7,201,840
We can now answer probability queries on P(A, B, C, D):
For example: P(A = 0, B = 0) = P(0, 0, C, D) ~= 0.125 c C
d D
105
Let X = {X1, …, Xn} be a set of random variables and D1, … , Dm X
Then, a Markov Network is defined as:
P(X1, X2, …, Xn) = 1 (D1) 2 (D2) … m (Dm)
Z(X1, X2, …, Xn)
Z(X1, X2, …, Xn) = 1 (D1) 2 (D2) … m (Dm)
Markov Network
106
Graphs vs. Probability Distributions
We will next relate graphs and probability distributions. This is
important for two reasons.
First, it allows us to determine properties (e.g., independence of
variables) of a probability distribution directly from the graph.
Second, it allows us to answer queries (e.g., MAP inference) on
the probability distribution by working with the graph.
107
Graphs vs. Probability Distributions
P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z
108
Which variables are independent?
Graphs vs. Probability Distributions
A B
C D
P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z
1 (A, B)
2 (B, D)
3 (D, C)
4 (C, A)
Intuitively, variables are nodes and factors introduce edges
109
A distribution P equipped with a set of factors 1 (D1) , … , m (Dm)
factorizes over a graph G if each Di is a complete subgraph of G
Graph Factorization
110
A distribution P equipped with a set of factors 1 (D1) , … , m (Dm)
factorizes over a graph G if each Di is a complete subgraph of G
A B
C D
P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z
Graph Factorization
111
A distribution P equipped with a set of factors 1 (D1) , … , m (Dm)
factorizes over a graph G if each Di is a complete subgraph of G
A B
C D
P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z
Here, P factorizes over G
Graph Factorization
112
A distribution P equipped with a set of factors 1 (D1) , … , m (Dm)
factorizes over a graph G if each Di is a complete subgraph of G
A B
C D
P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z
Here, P factorizes over G
A B
C D
P(A, B, C, D) = 1 (A, B, C) 2 (B, D, C) / Z
Graph Factorization
113
A distribution P equipped with a set of factors 1 (D1) , … , m (Dm)
factorizes over a graph G if each Di is a complete subgraph of G
A B
C D
P(A, B, C, D) = 1 (A, B) 2 (B, D) 3 (D, C) 4 (C, A) / Z
Here, P factorizes over G
A B
C D
P(A, B, C, D) = 1 (A, B, C) 2 (B, D, C) / Z
Here, P does not factorize over G
Graph Factorization
114
A distribution P equipped with a set of factors 1 (D1) , … , m (Dm)
factorizes over a graph G if each Di is a complete subgraph of G
A B
C D
P(A, B, C, D) = 1 (A, B) 2 (A, C) 3 (A, D)
4 (B, C) 5 (B, D) 6 (C, D) / Z
A B
C D
P(A, B, C, D) = 1 (A, B, C, D) / Z
Graph Factorization
115
Graphs vs. Probabilities
There is an important connection between a probability distribution that factorizes over a graph and the properties of the graph.
In particular, we can discover various independence properties of the probabilistic distribution directly from the graph
116
Two sets of disjoint nodes A and B in a graph G are separated by a set S, if
every path between A and B goes through a node in S. A and B are also
separated if no path between A and B exists regardless of set S.
The Global Markov Property says that if A and B are separated by S, then
A B | S that is P(A, B | S) = P(A | S) P(B | S)
Graph Separation
117
A B
C D
E F
G
Graph Separation: Example
118
Example of global independencies found in the above graph:
{A} {G} | {D}
{B} {F} | {E,G}
A B
C D
E F
G
Graph Separation: Example
119
P(A, G | D) = P(A | D) P(G | D)
Graphical Models: So far
So far we learned what a Markov Network is, what it means for a probability distribution to factor over a graph and how to extract information about independence of variables from the graph.
So far we have been discussing the joint probability distribution P(X1,…,Xn). We next focus on conditional distributions.
120
So Far: Generative Models
121
Next: Discriminative Models
Predict unknown facts given some known facts
122
Next: Discriminative Models
Predict unknown facts given some known facts
✘ ✘ ✘
✘ do not model distribution over known facts
Conditional Random Field
Let X Y be a set of random variables where X = {X1,…, Xn}, Y = {Y1,…,
Yn}. Here, X are the observed variables and Y are the target variables.
Let D1, D2, … , Dm X Y where Di X
A Conditional Random Field is defined as:
P(Y1,…,Yn X1,…, Xn) = 1 (D1) 2 (D2) … m (Dm)
Z(X1, …, Xn)
Note that the Z function ranges over variables in X only.
Y Z(X1,…, Xn) = 1 (D1) 2 (D2) … m (Dm)
[J. Lafferty, A. McCallum, F. Pereira, ICML 2001]
124
Key advantage: avoids encoding distribution over the variables X. Thus,
we can use many observed variables with complex dependencies without
needing to model any joint distributions over them.
P(Y1,…,Yn X1,…, Xn) = 1 (D1) 2 (D2) … m (Dm)
Z(X1, …, Xn)
Z(X1,…, Xn) = 1 (D1) 2 (D2) … m (Dm)
Conditional Random Field [J. Lafferty, A. McCallum, F. Pereira, ICML 2001]
Y
125
Conditional Random Field: Example Text Analytics (Named Entity Recognition)
Problem Statement:
Given a sentence, label each word with whether it is a person or a location
126
Conditional Random Field: Example Text Analytics (Named Entity Recognition)
Problem Statement:
Given a sentence, label each word with whether it is a person or a location
5 Labels: B-per, I-per, B-loc, I-loc, O
127
Conditional Random Field: Example Text Analytics (Named Entity Recognition)
Problem Statement:
Given a sentence, label each word with whether it is a person or a location
5 Labels: B-per, I-per, B-loc, I-loc, O
Mr. Smith came today to Toronto Given:
128
Conditional Random Field: Example Text Analytics (Named Entity Recognition)
Problem Statement:
Given a sentence, label each word with whether it is a person or a location
Mr. Smith came today to Toronto
B-per O O O B-loc I-per
Given:
Predict:
5 Labels: B-per, I-per, B-loc, I-loc, O
129
Conditional Random Field: Example Text Analytics (Named Entity Recognition)
Problem Statement:
Given a sentence, label each word with whether it is a person or a location
5 Labels: B-per, I-per, B-loc, I-loc, O
Mr. Smith came today to Toronto
X1 X2 X3 X4 X6 X7
130
Conditional Random Field: Example Text Analytics (Named Entity Recognition)
Problem Statement:
Given a sentence, label each word with whether it is a person or a location
5 Labels: B-per, I-per, B-loc, I-loc, O
Mr. Smith came today to Toronto
Y1 Y2 Y3 Y4 Y6 Y7
X1 X2 X3 X4 X6 X7
131
Conditional Random Field: Example Text Analytics (Named Entity Recognition)
Problem Statement:
Given a sentence, label each word with whether it is a person or a location
5 Labels: B-per, I-per, B-loc, I-loc, O
Two kinds of factors for each word Xt 1t (Yt , Yt+1) 2
t (Yt , X1 , … , Xt)
Captures relationship between
neighboring predictions
Relationship between prediction
and the neighbors of the word
Mr. Smith came today to Toronto
X1 X2 X3 X4 X6 X7
Y1 Y2 Y3 Y4 Y6 Y7
132
Conditional Random Field: Example Text Analytics (Named Entity Recognition)
Problem Statement:
Given a sentence, label each word with whether it is a person or a location
5 Labels: B-per, I-per, B-loc, I-loc, O
Two kinds of factors for each word Xt 1t (Yt , Yt+1) 2
t (Yt , X1 , … , Xt)
Mr. Smith came today to Toronto
Y1 Y2 Y3 Y4 Y6 Y7
X1 X2 X3 X4 X6 X7
133
Conditional Random Field: Example Text Analytics (Named Entity Recognition)
Problem Statement:
Given a sentence, label each word with whether it is a person or a location
P(Y1,…,Y7 X1,…, X7) =
11 (Y1 , Y2) … 7
6 (Y6 , Y7) 21 (Y1,X1,X2) … 2
7 (Y7,X6,X7)
11 (Y1 , Y2) … 7
6 (Y6 , Y7) 21 (Y1,X1,X2) … 2
7 (Y7,X6,X7)
Y1,Y2,Y3,Y4,Y5,Y6, Y7
Mr. Smith came today to Toronto
Y1 Y2 Y3 Y4 Y6 Y7
X1 X2 X3 X4 X6 X7
134
Conditional Random Field: Example Text Analytics (Named Entity Recognition)
Problem Statement:
Given a sentence, label each word with whether it is a person or a location
P(Y1,…,Y7 X1,…, X7) =
11 (Y1 , Y2) … 7
6 (Y6 , Y7) 21 (Y1,X1,X2) … 2
7 (Y7,X6,X7)
11 (Y1 , Y2) … 7
6 (Y6 , Y7) 21 (Y1,X1,X2) … 2
7 (Y7,X6,X7)
Y1,Y2,Y3,Y4,Y5,Y6, Y7
Mr. Smith came today to Toronto
Y1 Y2 Y3 Y4 Y6 Y7
X1 X2 X3 X4 X6 X7
B-per I-per B-loc O O O
135
Learning from Data
We next look at equivalent representation of Markov Networks and Conditional Random Fields that are amenable to learning from data (e.g., log-linear form).
136
Another representation of a Markov Network is that of a log-linear model
Here, we have a set of feature functions 𝑓1(D1),…, 𝑓𝑘(D𝑘) where each D𝑖 is a
complete subgraph of G. An 𝑓𝑖(D𝑖) can return any value, i.e., can be negative
Z(X1, …, Xn) = 𝑒𝑥𝑝− 𝑤𝑖 𝑓𝑖(D𝑖)𝑘
𝑖 =1
P(X1,…,Xn) =𝑒𝑥𝑝−𝑤1 𝑓
1D1 … 𝑒𝑥𝑝−𝑤𝑘 𝑓
𝑘D𝑘
Z(X1,…,Xn)
Any Markov Network whose factors are positive can be converted to a log-linear model
Log-Linear Models
137
Log-linear models have few benefits over factors. They:
1. Make certain relationships more explicit
2. Offer a much more compact representation for many distributions,
especially for variables with large domains (e.g., names in JSNice)
3. They are useful for learning where we are given the feature functions and
the learning phase simply figures out the weights.
Z(X1, …, Xn) = 𝑒𝑥𝑝− 𝑤𝑖 𝑓𝑖(D𝑖)𝑘
𝑖 =1
Log-Linear Representation: Benefits
P(X1,…,Xn) =𝑒𝑥𝑝−𝑤1 𝑓
1D1 … 𝑒𝑥𝑝−𝑤𝑘 𝑓
𝑘D𝑘
Z(X1,…,Xn)
Feature Function: Example For example, suppose that for the factor 1 :
1 (A, B) = 100 if A = B
1 otherwise
If A and B range over m and n values respectively, we have m n possible
values to store in order to keep this factor.
140
A B value
0 0 100
0 1 1
1 0 1
1 1 100
2 (A,B)
Feature Function: Example For example, suppose that for the factor 1 :
1 (A, B) = 100 if A = B
1 otherwise
If A and B range over m and n values respectively, we have m n possible
values to store in order to keep this factor.
Instead, we can use a single (indicator) feature function 𝑓1 A, B where 𝑓1 A, B = 1 if A = B and 𝑤1 = -4.61
0 else
141
A B value
0 0 100
0 1 1
1 0 1
1 1 100
2 (A,B)
Indicator functions can be directly extracted from data during learning (later)
Graphical Models: So far
In the last segment we learned about log-linear models and indicator functions.
Log-linear models are important because they allow to capture factors more concisely, the indicator functions can be directly extracted from data, and the weights can be learned (to fit the optimization objective, discussed later).
142
Hands-on
143
144
function chunkData(e, t) var n = []; var r = e.length; var i = 0; for (; i < r; i += t) if (i + t < r) n.push(e.substring(i, i + t)); else n.push(e.substring(i, r)); return n;
Structured Prediction for Programs
function chunkData(e, t) var n = []; var r = e.length; var i = 0; for (; i < r; i += t) if (i + t < r) n.push(e.substring(i, i + t)); else n.push(e.substring(i, r)); return n;
t r i
length
Unknown
facts:
Known
facts:
i t
r length
…
…
145
Structured Prediction for Programs
t r i
length
Unknown
facts:
Known
facts:
i t
r length
i t w
i step 0.5
j step 0.4
i r w
i len 0.6
i length 0.3
r length w
length length 0.5
len length 0.3
…
…
i
t
r length
function chunkData(e, t) var n = []; var r = e.length; var i = 0; for (; i < r; i += t) if (i + t < r) n.push(e.substring(i, i + t)); else n.push(e.substring(i, r)); return n;
Structured Prediction for Programs
t r i
length
Unknown
facts:
Known
facts:
i t
r length
i t w
i step 0.5
j step 0.4
i r w
i len 0.6
i length 0.3
r length w
length length 0.5
len length 0.3
…
…
i
t
r length
argmax 𝒘𝑇𝒇(𝑖, 𝑡, 𝑟, 𝑙𝑒𝑛𝑔𝑡ℎ)
function chunkData(e, t) var n = []; var r = e.length; var i = 0; for (; i < r; i += t) if (i + t < r) n.push(e.substring(i, i + t)); else n.push(e.substring(i, r)); return n;
Structured Prediction for Programs
t r i
length
Unknown
facts:
Known
facts:
i t
r length
i t w
i step 0.5
j step 0.4
i r w
i len 0.6
i length 0.3
r length w
length length 0.5
len length 0.3
…
…
i
t
r length i
len
step
argmax 𝒘𝑇𝒇(𝑖, 𝑡, 𝑟, 𝑙𝑒𝑛𝑔𝑡ℎ)
function chunkData(e, t) var n = []; var r = e.length; var i = 0; for (; i < r; i += t) if (i + t < r) n.push(e.substring(i, i + t)); else n.push(e.substring(i, r)); return n;
Structured Prediction for Programs
t r i
length
Unknown
facts:
Known
facts:
i t
r length
i t w
i step 0.5
j step 0.4
i r w
i len 0.6
i length 0.3
r length w
length length 0.5
len length 0.3
…
…
function chunkData(str, step) var colNames = []; var len = str.length; var i = 0; for (; i < len; i += step) if (i + step < len) colNames.push(str.substring(i, i + step)); else colNames.push(str.substring(i, len)); return colNames;
i
t
r length i
len
step
argmax 𝒘𝑇𝒇(𝑖, 𝑡, 𝑟, 𝑙𝑒𝑛𝑔𝑡ℎ)
function chunkData(e, t) var n = []; var r = e.length; var i = 0; for (; i < r; i += t) if (i + t < r) n.push(e.substring(i, i + t)); else n.push(e.substring(i, r)); return n;
Structured Prediction for Programs
• Motivation • Potential applications
• Statistical Language Models
• N-gram, Recurrent Networks, PCFGs • Application: code completion
• Graphical Models
• Markov Networks, Conditional Random Fields • Inference & Learning in Markov Networks • Application: predicting names and types
• Learning Features for Programs • Combining Program Synthesis + Machine Learning
Tutorial Outline
150
Queries: MAP vs. Max Marginals
MAP Inference: (𝑦1, … , 𝑦𝑛)MAP = argmax 𝑃(𝑦1, … , 𝑦𝑛) 𝑦1, … , 𝑦𝑛
Max Marginals: (𝑦1, … , 𝑦𝑛)ML = (𝑦1ML, … , 𝑦𝑛ML)
𝑦1ML = argmax 𝑃(𝑦1)
𝑦1
𝑦nML = argmax 𝑃(𝑦𝑛)
𝑦𝑛 ……
VS.
151
Queries: MAP vs. Max Marginals
MAP Inference: (𝑦1, … , 𝑦𝑛)MAP = argmax 𝑃(𝑦1, … , 𝑦𝑛) 𝑦1, … , 𝑦𝑛
Max Marginals: (𝑦1, … , 𝑦𝑛)ML = (𝑦1ML, … , 𝑦𝑛ML)
𝑦1ML = argmax 𝑃(𝑦1)
𝑦1
𝑦nML = argmax 𝑃(𝑦𝑛)
𝑦𝑛 ……
VS.
A B val
0 0 0.25
0 1 0.3
1 0 0.10
1 1 0.35
Consider the following probability distribution:
152
Queries: MAP vs. Max Marginals
MAP Inference: (𝑦1, … , 𝑦𝑛)MAP = argmax 𝑃(𝑦1, … , 𝑦𝑛) 𝑦1, … , 𝑦𝑛
Max Marginals: (𝑦1, … , 𝑦𝑛)ML = (𝑦1ML, … , 𝑦𝑛ML)
𝑦1ML = argmax 𝑃(𝑦1)
𝑦1
𝑦nML = argmax 𝑃(𝑦𝑛)
𝑦𝑛 ……
VS.
A B val
0 0 0.25
0 1 0.3
1 0 0.10
1 1 0.35
Consider the following probability distribution:
A argmax 𝑃(A)
argmax 𝑃(A, B) A, B B
argmax 𝑃(B) 153
Queries: MAP vs. Max Marginals
MAP Inference: (𝑦1, … , 𝑦𝑛)MAP = argmax 𝑃(𝑦1, … , 𝑦𝑛) 𝑦1, … , 𝑦𝑛
Max Marginals: (𝑦1, … , 𝑦𝑛)ML = (𝑦1ML, … , 𝑦𝑛ML)
𝑦1ML = argmax 𝑃(𝑦1)
𝑦1
𝑦nML = argmax 𝑃(𝑦𝑛)
𝑦𝑛 ……
VS.
A B val
0 0 0.25
0 1 0.3
1 0 0.10
1 1 0.35
Consider the following probability distribution:
(1,1) MAP (0,1) ML
A argmax 𝑃(A)
argmax 𝑃(A, B) A, B B
argmax 𝑃(B) 154
Queries: MAP vs. Max Marginals
MAP Inference: (𝑦1, … , 𝑦𝑛)MAP = argmax 𝑃(𝑦1, … , 𝑦𝑛) 𝑦1, … , 𝑦𝑛
Max Marginals: (𝑦1, … , 𝑦𝑛)ML = (𝑦1ML, … , 𝑦𝑛ML)
𝑦1ML = argmax 𝑃(𝑦1)
𝑦1
𝑦nML = argmax 𝑃(𝑦𝑛)
𝑦𝑛 ……
VS.
Baseline 25.3%
Max Marginals 54.1%
MAP Inference 63.4%
155
MAP Inference
Bad news: computing the argmax is in general NP-hard (Max-SAT)
Good news: many approximate methods exists • Gibbs sampling
• Expectation Maximization
• Variational methods
• Elimination Algorithm
• Junction-Tree algorithm
• Loopy Belief Propagation
156
MAP Inference
Bad news: computing the argmax is in general NP-hard (Max-SAT)
Good news: many approximate methods exists • Gibbs sampling
• Expectation Maximization
• Variational methods
• Elimination Algorithm
• Junction-Tree algorithm
• Loopy Belief Propagation
157
Bad news: still too slow for learning
𝑂(𝑁 ∙ 𝑀𝑁)
𝑁: number of neighbors 𝑀: number of states
MAP Inference
Bad news: computing the argmax is in general NP-hard (Max-SAT)
Good news: many approximate methods exists • Gibbs sampling
• Expectation Maximization
• Variational methods
• Elimination Algorithm
• Junction-Tree algorithm
• Loopy Belief Propagation
158
Bad news: still too slow for learning
𝑂(𝑁 ∙ 𝑀𝑁)
𝑁: number of neighbors 𝑀: number of states
We designed an approximate one to fit our needs , i.e., deal with many values
Structured SVM Training [N. Ratliff, J. Bagnell, M. Zinkevich, AISTATS 2007]
Given a data set: D = { xj , yj }j=1..n learn weights 𝒘𝑇
𝑃 𝑦 𝑥 =1
𝑍(𝑥)exp (𝒘𝑇𝒇(𝑦, 𝑥))
159
Structured SVM Training
Optimization objective (max-margin training):
∀j ∀y Σ 𝑤ifi(x(j),y(j)) ≥ Σ 𝑤ifi(x(j),y) + 𝚫(y,y(j))
[N. Ratliff, J. Bagnell, M. Zinkevich, AISTATS 2007]
Given a data set: D = { xj , yj }j=1..n learn weights 𝒘𝑇
𝑃 𝑦 𝑥 =1
𝑍(𝑥)exp (𝒘𝑇𝒇(𝑦, 𝑥))
160
Structured SVM Training
Optimization objective (max-margin training):
∀j ∀y Σ 𝑤ifi(x(j),y(j)) ≥ Σ 𝑤ifi(x(j),y) + 𝚫(y,y(j))
[N. Ratliff, J. Bagnell, M. Zinkevich, AISTATS 2007]
Given a data set: D = { xj , yj }j=1..n learn weights 𝒘𝑇
for all samples
Given prediction is better than any other prediction by a margin
𝑃 𝑦 𝑥 =1
𝑍(𝑥)exp (𝒘𝑇𝒇(𝑦, 𝑥))
161
Structured SVM Training
Optimization objective (max-margin training):
∀j ∀y Σ 𝑤ifi(x(j),y(j)) ≥ Σ 𝑤ifi(x(j),y) + 𝚫(y,y(j))
[N. Ratliff, J. Bagnell, M. Zinkevich, AISTATS 2007]
Given a data set: D = { xj , yj }j=1..n learn weights 𝒘𝑇
for all samples
Given prediction is better than any other prediction by a margin
Avoids expensive computation of the partition function 𝑍(𝑥)
𝑃 𝑦 𝑥 =1
𝑍(𝑥)exp (𝒘𝑇𝒇(𝑦, 𝑥))
162
…
Learning Phase
SSVM
learning
program
analysis
transform
Prediction Phase
program
analysis
var n = []; var r = e.length; var i = 0; for (; i < r; i += t) if (i + t < r) n.push(e.subs(i, i + t)); else n.push(e.subs(i, r)); return n;
var colNames = []; var len = str.length; var i = 0; for (; i < len; i += step) if (i + step < len) colNames.push(str.subs(i, i + step)); else colNames.push(str.subs(i, len)); return colNames;
MAP
inference
alias,call
analysis 7M feature functions for names
70K feature functions for types
max-margin
training
150MB
30 nodes, 400 edges
Time: milliseconds
Names: 63% Types: 81% (helps typechecking)
Conditional Random Field
𝑃 𝑦 𝑥
Structured Prediction for Programs (V. Raychev, M. V., A. Krause, ACM POPL’15)
164
• Motivation • Potential applications
• Statistical Language Models
• N-gram, Recurrent Networks, PCFGs • Application: code completion
• Graphical Models
• Markov Networks, Conditional Random Fields • Inference & Learning in Markov Networks • Application: predicting names and types
• Learning Features for Programs • Combining Program Synthesis + Machine Learning
Tutorial Outline
165
Models of Full Language
• So far:
– N-gram and RNNs
– Modelling specific elements (APIs)
• How can be build statistical language models of full programming language?
166
Program Representation elem.notify({
position: ‘top’,
autoHide: false,
delay: 100
});
167
Sequences
elem, ., notify, (, {,
position, :, ‘top’, ‘,’,
autoHide, :, false, ‘,’,
delay, :, 100,
}, ), ;
Program Representation elem.notify({
position: ‘top’,
autoHide: false,
delay: 100
});
Sequences
elem, ., notify, (, {,
position, :, ‘top’, ‘,’,
autoHide, :, false, ‘,’,
delay, :, 100,
}, ), ;
168
Learned Model
?, ?, ?, (, {,
?, :, ?, ‘,’,
?, :, ?, ‘,’,
?, :, ?,
}, ?, ;
Statistical Code Synthesis: Existing Approaches
x arg max P( x | )
promise
defer
reject
Semantic
Hard-coded heuristics
Task & Language specific
x arg max P( x | )
label
(features)
conditioning context
Syntactic
Bad fit for
programs
[Nguyen et al., 2013]
[Allamanis et al., 2014]
[Raychev et al., 2014]
[Hindle et al., 2012]
[Allamanis et al., 2015]
169
Program Representation elem.notify({
position: ‘top’,
autoHide: false,
delay: 100
});
Identifier
elem
Property
notify
Property
position
Property
autoHide
MemberExpression ObjectExpression
CallExpression
Property
delay
String
‘top’
Boolean
false
Trees
170
Probabilistic Context Free Grammars
• 𝑁 is a finite set of non-terminal symbols
• Σ is a finite set of terminal symbols
• 𝑅 is a finite set of production rules 𝛼 → 𝛽1…𝛽𝑛
• 𝑆 ∈ 𝑁 is a start symbol
• 𝑞: 𝑅 → ℝ 0,1 is a conditional probability of choosing given production rule
𝛼 ∈ 𝑁 𝛽𝑖 ∈ (𝑁⋃Σ)
171
Probabilistic Context Free Grammars
Identifier
elem
Property
notify
Property
position
Property
autoHide
MemberExpression ObjectExpression
CallExpression
Property
delay
String
‘top’
Boolean
false
CallExpr → MemberExpr [ elem.notify() ]
CallExpr → MemberExpr Identifier [ elem.notify(x) ]
CallExpr → Identifier [ notify() ]
CallExpr → MemberExpr ObjectExpr [ elem.notify({…}) ]
172
Probabilistic Context Free Grammars
CallExpr → MemberExpr [ elem.notify() ]
CallExpr → MemberExpr Identifier [ elem.notify(x) ]
CallExpr → MemberExpr ObjectExpr [ elem.notify({…}) ]
𝑞 𝛼 → 𝛽1…𝛽𝑛 ≈ #(𝛼 → 𝛽1 …𝛽𝑛)
#(𝛼)
training
∀𝛼 𝑞 𝛼 → 𝛽1…𝛽𝑛 = 1
𝛼→𝛽1…𝛽𝑛∈𝑅
valid probability distribution 173
Probabilistic Context Free Grammars
poor context
for prediction
non-trivial to obtain
good production rules
[Allamanis & Sutton, FSE’14] [Gvero & Kuncak, OOPSLA’15]
[Maddison & Tarlow, ICML’14]
[Liang et.al., ICML’10]
CallExpr → MemberExpr [ elem.notify() ]
CallExpr → MemberExpr Identifier [ elem.notify(x) ]
CallExpr → MemberExpr ObjectExpr [ elem.notify({…}) ]
174
• Probabilistic Grammars
+ Easy to implement and cheap to train + Defined over generic AST representation of programs - Important to deal with sparsity via smoothing - Non-trivial to define good production rules - Production rules selected based on limited context (e.g., parent non-terminal symbol)
Lessons
175
• Probabilistic Grammars
+ Easy to implement and cheap to train + Defined over generic AST representation of programs - Important to deal with sparsity via smoothing - Non-trivial to define good production rules - Production rules selected based on limited context (e.g., parent non-terminal symbol)
Lessons
176
PCFG are still poor at modelling full
programming language but are
a step towards the solution.
Our Goal
Probabilistic
Model
High
Precision
Efficient
Learning
Widely
Applicable
Existing Programs Learning Model
Explainable
Predictions
177
Key Idea
178
P( wi | context )
N-grams
RNNs
CRFs
PCFGs
So far: Context defined manually for each model
Key Idea
179
P( wi | context )
N-grams
RNNs
CRFs
PCFGs
So far: Context defined manually for each model
Goal: Use program synthesis and machine learning techniques to learn best context for given query wi
PHOG: Concepts
Use function to build a probabilistic model.
Generalizes PCFGs to allow conditioning on richer
context.
Program synthesis learns a function that explains the
data. The function returns a conditioning context for a
given query.
180
Generalizing PCFG
Context Free Grammar
𝜶 → 𝜷1… 𝜷n
P
Property → x 0.05
Property → y 0.03
Property → promise 0.001
181
PHOG: Generalizes PCFG
Higher Order Grammar
𝜶[𝜸] → 𝜷1… 𝜷n
P
Property[reject, promise] → promise 0.67
Property[reject, promise] → notify 0.12
Property[reject, promise] → resolve 0.11
Context Free Grammar
𝜶 → 𝜷1… 𝜷n
P
Property → x 0.05
Property → y 0.03
Property → promise 0.001
182
𝜶[𝜸] → 𝜷1… 𝜷n
Conditioning on Richer Context
What is the best conditioning context?
183
𝜶[𝜸] → 𝜷1… 𝜷n
Conditioning on Richer Context
What is the best conditioning context?
- Control Structures
- …
- Identifiers
- Constants
- APIs
- Fields
184
𝜶[𝜸] → 𝜷1… 𝜷n
Conditioning on Richer Context
What is the best conditioning context?
- Control Structures
- …
- Identifiers
- Constants
- APIs
- Fields
? 𝜸
Source
Code
Conditioning
Context 185
Production Rules R:
𝜶[𝜸] → 𝜷1… 𝜷n
Function:
f: → 𝜸
Higher Order Grammar
Parametrize the grammar by a function
used to dynamically obtain the context
186
Production Rules R:
𝜶[𝜸] → 𝜷1… 𝜷n
Function:
f: AST → 𝜸
Higher Order Grammar
Parametrize the grammar by a function
used to dynamically obtain the context
187
Higher Order Grammar
𝜸 ( ) f
Production Rules R:
𝜶[𝜸] → 𝜷1… 𝜷n
Function:
f: AST → 𝜸
Source
Code
Conditioning
Context
Abstract
Syntax Tree
Function
Application 188
Function Representation
In general:
Unrestricted programs (Turing complete)
Our Work:
TCond Language for navigating over trees
and accumulating context
TCond ::= 𝜀 | WriteOp TCond | MoveOp TCond
MoveOp ::= Up, Left, Right, DownFirst, DownLast,
NextDFS, PrevDFS, NextLeaf, PrevLeaf,
PrevNodeType, PrevNodeValue, PrevNodeContext
WriteOp ::= WriteValue, WriteType, WritePos
189
Expressing functions: TCond Language
Up
Left
TCond ::= 𝜀 | WriteOp TCond | MoveOp TCond
MoveOp ::= Up, Left, Right, DownFirst, DownLast,
NextDFS, PrevDFS, NextLeaf, PrevLeaf,
PrevNodeType, PrevNodeValue, PrevNodeContext
WriteOp ::= WriteValue, WriteType, WritePos
WriteValue
𝜸 ← 𝜸 ∙
190
Example
elem.notify(
... ,
... ,
{
position: ‘top’,
hide: false,
?
}
);
𝜸
Query TCond
191
Example
elem.notify(
... ,
... ,
{
position: ‘top’,
hide: false,
?
}
);
TCond
Left
WriteValue
𝜸
{}
{hide}
Query
192
Example
elem.notify(
... ,
... ,
{
position: ‘top’,
hide: false,
?
}
);
TCond
Left
WriteValue
Up
WritePos
𝜸
{}
{hide}
{hide}
{hide, 3}
Query
193
Example
elem.notify(
... ,
... ,
{
position: ‘top’,
hide: false,
?
}
);
TCond
Left
WriteValue
Up
WritePos
Up
DownFirst
DownLast
WriteValue
𝜸
{}
{hide}
{hide}
{hide, 3}
{hide, 3}
{hide, 3}
{hide, 3}
{hide, 3, notify}
Query
194
elem.notify(
... ,
... ,
{
position: ‘top’,
hide: false,
?
}
);
Example
{ Previous Property, Parameter Position, API name }
TCond
Left
WriteValue
Up
WritePos
Up
DownFirst
DownLast
WriteValue
𝜸
{}
{hide}
{hide}
{hide, 3}
{hide, 3}
{hide, 3}
{hide, 3}
{hide, 3, notify}
Query
195
Learning PHOG
TCond ::= 𝜀 | WriteOp TCond | MoveOp TCond
MoveOp ::= Up, Left, Right, ...
WriteOp ::= WriteValue, WriteType, ...
fbest = arg min cost(D, f) f ∊ TCond
Existing Dataset
TCond Language
196
Synthesis: Practically Intractable
• Millions (≈ 108) of input/output examples
𝑓𝑏𝑒𝑠𝑡 = argmin 𝑐𝑜𝑠𝑡(𝐷, 𝑓) f ∈ 𝑇𝐶𝑜𝑛𝑑
𝑂( 𝐷 ) computing cost
Synthesis: practically intractable
Key Idea: iterative synthesis on fraction of examples
197
Solution: Two Components
Program Generator
Dataset Sampler
𝑓𝑏𝑒𝑠𝑡 = argmin 𝑐𝑜𝑠𝑡(𝑑, 𝑓) f ∈ 𝑇𝐶𝑜𝑛𝑑
representative dataset sampler picks dataset 𝑑 ⊆ 𝐷
198
In a Loop
• Start with small number of samples 𝑑 ⊆ 𝐷
• Iteratively generate programs and samples
Program Generator
Dataset Sampler
Program Generator
Dataset Sampler
𝑝1, 𝑝2
Dataset Sampler 𝑑1 ⊆ 𝐷
𝑑2 ⊆ 𝐷
𝑝1
𝑝𝑏𝑒𝑠𝑡
199
Representative Dataset Sampler
Idea: pick a small dataset 𝑑 for which a set of already generated programs 𝑝1…𝑝𝑛 behave like on full dataset 𝐷
𝑑 = arg𝑚𝑖𝑛𝑑⊆𝐷𝑚𝑎𝑥𝑖 ∈1 …𝑛 𝑐𝑜𝑠𝑡 𝑑, 𝑝𝑖 − 𝑐𝑜𝑠𝑡(𝐷, 𝑝𝑖)
cost on 𝑑
cost on 𝐷
𝑝1
𝑝1
𝑝2
𝑝2
200
Representative Dataset Sampler
Idea: pick a small dataset 𝑑 for which a set of already generated programs 𝑝1…𝑝𝑛 behave like on full dataset 𝐷
Theorem: The sampler shrinks search space of candidate programs
𝑑 = arg𝑚𝑖𝑛𝑑⊆𝐷𝑚𝑎𝑥𝑖 ∈1 …𝑛 𝑐𝑜𝑠𝑡 𝑑, 𝑝𝑖 − 𝑐𝑜𝑠𝑡(𝐷, 𝑝𝑖)
201
Results
Probabilistic Model of JavaScript Language
50k Blind Set 100k Training 20k Learning
202
Results: JavaScript APIs
Last two tokens, Hindle et. al. [ICSE’12]
Last two APIs, Raychev et. al. [PLDI’14]
Program synthesis
Program synthesis + dataset sampler
Conditioning program p Accuracy
22.2%
30.4%
46.3%
50.4%
203
Results: JavaScript Structure
PCFG
N-gram
Naïve Bayes
Program synthesis
Model Accuracy
51.1%
71.3%
44.2%
81.5%
SVM 70.5%
204
Results: JavaScript Values
Identifier
Property
String
Number
Accuracy Example
RegExp
UnaryExpr
BinaryExpr
LogicalExpr
62%
65%
52%
64%
66%
97%
74%
92%
contains = jQuery …
start = list.length;
‘[‘ + attrs + ‘]’
canvas(xy[0], xy[1], …)
line.replace(/( | )+/, …)
if (!events || !…)
while (++index < …)
frame = frame || …
205
• Motivation • Potential applications
• Statistical Language Models
• N-gram, Recurrent Networks, PCFGs • Application: code completion
• Graphical Models
• Markov Networks, Conditional Random Fields • Inference & Learning in Markov Networks • Application: predicting names and types
• Learning Features for Programs • Combining Program Synthesis + Machine Learning
Tutorial Outline
206
Applications
Intermediate
Representation
Analyze Program
(PL)
Train Model
(ML)
Query Model
(ML)
Code completion
Deobfuscation
Program
synthesis Feedback
generation
Translation
alias analysis typestate analysis
N-gram language model
SVM Structured SVM
Neural Networks
Sequences
(sentences) Trees
Translation Table
Feature Vectors
control-flow
analysis scope analysis
argmax P(y | x)
y ∈ Ω
http://www.nice2predict.org/
http://www.srl.inf.ethz.ch/
More information
and tutorials at:
Greedy
MAP inference
Graphical Models (CRFs)
Machine Learning for Programs
Statistical Programming Tools
208
Understand code/security [POPL’15]:
JavaScript Deobfuscation
Type Prediction
...
for x in range(a):
print a[x]
Debug code:
Statistical Bug Detection
Write new code [PLDI’14]:
Code Completion
Camera camera = Camera.open();
camera.SetDisplayOrientation(90);
?
likely
error
Port code [ONWARD’14]:
Programming Language Translation
www.jsnice.org
All of these benefit from the probabilistic model for code.
Applications
Intermediate
Representation
Analyze Program
(PL)
Train Model
(ML)
Query Model
(ML)
Code completion
Deobfuscation
Program
synthesis Feedback
generation
Translation
alias analysis typestate analysis
N-gram language model
SVM Structured SVM
Neural Networks
Sequences
(sentences) Trees
Translation Table
Feature Vectors
control-flow
analysis scope analysis
argmax P(y | x)
y ∈ Ω
http://www.nice2predict.org/
http://www.srl.inf.ethz.ch/
More information
and tutorials at:
Greedy
MAP inference
Graphical Models (CRFs)
Machine Learning for Programs
Materials
Dagstuhl Seminar on Big Code Analytics, Nov 2015 http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=15472
– ML, NLP, PL, SE
– Link to materials, people in the general area
http://learningfrombigcode.org
– data sets, tools, challenges
http://nice2predict.org
– fully open source, online demo, build your own tool
210
Martin
Vechev
Veselin
Raychev
Pavol
Bielik Christine
Zeller
Svetoslav
Karaivanov
Pascal
Roos
Benjamin
Bischel Andreas
Krause
More information: http://www.srl.inf.ethz.ch/
Probabilistically likely solutions to problems impossible to solve otherwise
Timon
Gehr
Publications
PHOG: Probabilistic Mode for Code, ACM ICML’16
Learning Programs from Noisy Data, ACM POPL’16
Predicting Program Properties from “Big Code”, ACM POPL’15
Code Completion with Statistical Language Models, ACM PLDI’14
Machine Translation for Programming Languages, ACM Onward’14
http://JSNice.org
statistical de-obfuscation
http://Nice2Predict.org
structured prediction framework
Publicly Available Tools
Petar
Tsankov
Probabilistic Learning from Big Code