programming with “big code” - github pages · predicting program properties from “big...

Programming with “Big Code”: Lessons, Techniques, Applications

Pavol Bielik, Veselin Raychev, Martin VechevDepartment of Computer ScienceETH Zurich

Work @ ETH Zurich

Work on “Big Code” started a few years ago

Code Completion with Statistical Language Models, PLDI 2014Machine Translation for Programming Languages, Onward 2014Predicting Program Properties from “Big Code”, POPL 2015Fast and Precise Statistical Code Completion, ETH TRStatistical Feedback Generation for Programs, ETH TRProgramming with Big Code: Lessons, Techniques and Applications, SNAPL 2015

Prof.Martin Vechev

Prof.AndreasKrause

VeselinRaychev

PavolBielik

Svetoslav Karaivanov

ChristineZeller

PascalRoos

Applications[PLDI 14]SLANG: Code Completion

Intent i = new Intent();

?ctx.sendBroadcast(i);

All of these benefit from the “Big Code” and lead to applications not possible with previous techniques




P( Java | C# )P( C# | Java )P( Java )

[Onward 14]Programming Language Translation


...for x in range(a):

print a[x]

[submitted]Statistical Feedback Generation




likely error




[POPL 15]JSNice: DeobfuscationType Prediction

...for x in range(a):

print a[x]

[submitted]Statistical Feedback Generation




likely error




Probabilistic Programming Systems: Dimensions

Applications

Intermediate Representation

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

What is a generic metric for code?Applications


Analyze Program(PL)

Train Model(ML)

Query Model(ML)

✔ Cross Entropy → ✗ Code Completion✔ BLEU Score → ✗ Program Translation


Traditional metrics might not be indicative of client performance

What is the best program representation?Applications


Analyze Program(PL)

Train Model(ML)

Query Model(ML)




Analyze Program(PL)

Train Model(ML)

Query Model(ML)


Sequences

req → {<open, 0>, <send, 0>}source → {..., <open, 2>}

=

a +

x y

Trees

Graphical Models Feature Vectors

req → (0,0,1,1,0)source → (1,0,0,0,0)...



Analyze Program(PL)

Train Model(ML)

Query Model(ML)


Choosing the right representation is crucial

Feedback Generation: Sequence representations

Allamanis et. al. [2013] 46.4%

Hsiao et. al. [2014] 50.8%

Incorporate semantic information 75.3%

Incorporate dataflow analysis 86.3%

Applications


Analyze Program(PL)

Train Model(ML)

Query Model(ML)

How to extract program representation?SLANG (APIs): alias and typestate analysisJSNice (Variable Names): scope and alias analysisFeedback Generation: alias, control-flow and typestate analysis


req.open("GET", source, false);

req → {<open, 0>, <send, 0>}source → {..., <open, 2>}

Applications


Analyze Program(PL)

Train Model(ML)

Query Model(ML)

How to extract program representation?SLANG (APIs): alias and typestate analysisJSNice (Variable Names): scope and alias analysisFeedback Generation: alias, control-flow and typestate analysis

Design scalable yet precise enough algorithms


1

0.5

0

no alias analysiswith alias analysis

1% 10% 100%

[Precision vs % of data used]

Applications


Analyze Program(PL)

Train Model(ML)

Query Model(ML)

What is the suitable probabilistic model?N-gram language model

Probabilistic context-free grammarsNeural networks

(Structured) Support vector machineConditional Random Fields

...


Applications


Analyze Program(PL)

Train Model(ML)

Query Model(ML)

What is the suitable probabilistic model?N-gram language model

Probabilistic context-free grammarsNeural networks

(Structured) Support vector machineConditional Random Fields

...


Baseline 25.3%

Independent 54.1%

Structured 63.4%

Structured prediction is critical

Programming with “Big Code”Applications


Analyze Program(PL)

Train Model(ML)

Query Model

Code completionDeobfuscation

Program synthesis

Feedback generation

Translation

alias analysis

typestate analysis

Graphical Models

N-gram language modelSVM Structured SVM

Neural Networks

Sequences (sentences)

Trees

Translation TableFeature Vectors

control-flow analysis

scope analysis

argmax P(y | x)

y ∈ Ω



Analyze Program(PL)

Train Model(ML)

Query Model


Program synthesis

Feedback generation

Translation

alias analysis

typestate analysis

Graphical Models


Neural Networks


Trees



scope analysis

argmax P(y | x)

y ∈ Ω

Greedy MAP Inference

http://www.nice2predict.org/http://www.srl.inf.ethz.ch/spas.php More information and tutorials at:

http://www.nice2predict.org/


http://www.srl.inf.ethz.ch/spas.php


General framework


We have open-sourced our prediction engine and we are extending it with new capabilities

Upcoming PLDI’15 tutorial





Analyze Program(PL)

Train Model(ML)

Query Model


Program synthesis

Feedback generation

Translation

alias analysis

typestate analysis

Graphical Models


Neural Networks


Trees



scope analysis

argmax P(y | x)

y ∈ Ω

Greedy MAP Inference

http://www.nice2predict.org/http://www.srl.inf.ethz.ch/spas.php More information and tutorials at:





programming with “big code” - github pages · predicting program properties from “big...

Documents