programming with “big code” - github pages · predicting program properties from “big...

19
Programming with “Big Code”: Lessons, Techniques, Applications Pavol Bielik, Veselin Raychev, Martin Vechev Department of Computer Science ETH Zurich

Upload: others

Post on 23-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

Programming with “Big Code”: Lessons, Techniques, Applications

Pavol Bielik, Veselin Raychev, Martin VechevDepartment of Computer ScienceETH Zurich

Page 2: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

Work @ ETH Zurich

Work on “Big Code” started a few years ago

Code Completion with Statistical Language Models, PLDI 2014Machine Translation for Programming Languages, Onward 2014Predicting Program Properties from “Big Code”, POPL 2015Fast and Precise Statistical Code Completion, ETH TRStatistical Feedback Generation for Programs, ETH TRProgramming with Big Code: Lessons, Techniques and Applications, SNAPL 2015

Prof.Martin Vechev

Prof.AndreasKrause

VeselinRaychev

PavolBielik

Svetoslav Karaivanov

ChristineZeller

PascalRoos

Page 3: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

Applications[PLDI 14]SLANG: Code Completion

Intent i = new Intent();

?ctx.sendBroadcast(i);

All of these benefit from the “Big Code” and lead to applications not possible with previous techniques

Page 4: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

Applications[PLDI 14]SLANG: Code Completion

Intent i = new Intent();

?ctx.sendBroadcast(i);

P( Java | C# )P( C# | Java )P( Java )

[Onward 14]Programming Language Translation

All of these benefit from the “Big Code” and lead to applications not possible with previous techniques

Page 5: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

...for x in range(a):

print a[x]

[submitted]Statistical Feedback Generation

Applications[PLDI 14]SLANG: Code Completion

Intent i = new Intent();

?ctx.sendBroadcast(i);

likely error

P( Java | C# )P( C# | Java )P( Java )

[Onward 14]Programming Language Translation

All of these benefit from the “Big Code” and lead to applications not possible with previous techniques

Page 6: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

[POPL 15]JSNice: DeobfuscationType Prediction

...for x in range(a):

print a[x]

[submitted]Statistical Feedback Generation

Applications[PLDI 14]SLANG: Code Completion

Intent i = new Intent();

?ctx.sendBroadcast(i);

likely error

P( Java | C# )P( C# | Java )P( Java )

[Onward 14]Programming Language Translation

All of these benefit from the “Big Code” and lead to applications not possible with previous techniques

Page 7: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

Probabilistic Programming Systems: Dimensions

Applications

Intermediate Representation

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

Page 8: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

What is a generic metric for code?Applications

Intermediate Representation

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

✔ Cross Entropy → ✗ Code Completion✔ BLEU Score → ✗ Program Translation

Probabilistic Programming Systems: Dimensions

Traditional metrics might not be indicative of client performance

Page 9: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

What is the best program representation?Applications

Intermediate Representation

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

Probabilistic Programming Systems: Dimensions

Page 10: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

What is the best program representation?Applications

Intermediate Representation

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

Probabilistic Programming Systems: Dimensions

Sequences

req → {<open, 0>, <send, 0>}source → {..., <open, 2>}

=

a +

x y

Trees

Graphical Models Feature Vectors

req → (0,0,1,1,0)source → (1,0,0,0,0)...

Page 11: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

What is the best program representation?Applications

Intermediate Representation

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

Probabilistic Programming Systems: Dimensions

Choosing the right representation is crucial

Feedback Generation: Sequence representations

Allamanis et. al. [2013] 46.4%

Hsiao et. al. [2014] 50.8%

Incorporate semantic information 75.3%

Incorporate dataflow analysis 86.3%

Page 12: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

Applications

Intermediate Representation

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

How to extract program representation?SLANG (APIs): alias and typestate analysisJSNice (Variable Names): scope and alias analysisFeedback Generation: alias, control-flow and typestate analysis

Probabilistic Programming Systems: Dimensions

req.open("GET", source, false);

req → {<open, 0>, <send, 0>}source → {..., <open, 2>}

Page 13: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

Applications

Intermediate Representation

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

How to extract program representation?SLANG (APIs): alias and typestate analysisJSNice (Variable Names): scope and alias analysisFeedback Generation: alias, control-flow and typestate analysis

Design scalable yet precise enough algorithms

Probabilistic Programming Systems: Dimensions

1

0.5

0

no alias analysiswith alias analysis

1% 10% 100%

[Precision vs % of data used]

Page 14: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

Applications

Intermediate Representation

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

What is the suitable probabilistic model?N-gram language model

Probabilistic context-free grammarsNeural networks

(Structured) Support vector machineConditional Random Fields

...

Probabilistic Programming Systems: Dimensions

Page 15: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

Applications

Intermediate Representation

Analyze Program(PL)

Train Model(ML)

Query Model(ML)

What is the suitable probabilistic model?N-gram language model

Probabilistic context-free grammarsNeural networks

(Structured) Support vector machineConditional Random Fields

...

Probabilistic Programming Systems: Dimensions

Baseline 25.3%

Independent 54.1%

Structured 63.4%

Structured prediction is critical

Page 16: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

Programming with “Big Code”Applications

Intermediate Representation

Analyze Program(PL)

Train Model(ML)

Query Model

Code completionDeobfuscation

Program synthesis

Feedback generation

Translation

alias analysis

typestate analysis

Graphical Models

N-gram language modelSVM Structured SVM

Neural Networks

Sequences (sentences)

Trees

Translation TableFeature Vectors

control-flow analysis

scope analysis

argmax P(y | x)

y ∈ Ω

Page 17: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

Programming with “Big Code”Applications

Intermediate Representation

Analyze Program(PL)

Train Model(ML)

Query Model

Code completionDeobfuscation

Program synthesis

Feedback generation

Translation

alias analysis

typestate analysis

Graphical Models

N-gram language modelSVM Structured SVM

Neural Networks

Sequences (sentences)

Trees

Translation TableFeature Vectors

control-flow analysis

scope analysis

argmax P(y | x)

y ∈ Ω

Greedy MAP Inference

http://www.nice2predict.org/http://www.srl.inf.ethz.ch/spas.php More information and tutorials at:

Page 18: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

General framework

http://www.nice2predict.org/

We have open-sourced our prediction engine and we are extending it with new capabilities

Upcoming PLDI’15 tutorial

Page 19: Programming with “Big Code” - GitHub Pages · Predicting Program Properties from “Big Code”, POPL 2015 Fast and Precise Statistical Code Completion, ETH TR Statistical Feedback

Programming with “Big Code”Applications

Intermediate Representation

Analyze Program(PL)

Train Model(ML)

Query Model

Code completionDeobfuscation

Program synthesis

Feedback generation

Translation

alias analysis

typestate analysis

Graphical Models

N-gram language modelSVM Structured SVM

Neural Networks

Sequences (sentences)

Trees

Translation TableFeature Vectors

control-flow analysis

scope analysis

argmax P(y | x)

y ∈ Ω

Greedy MAP Inference

http://www.nice2predict.org/http://www.srl.inf.ethz.ch/spas.php More information and tutorials at: