create smart software engineering tools through machine ... · machine learning models of aspects...

97
Understanding Source Code through Machine Learning to Create Smart Software Engineering Tools Miltos Allamanis, University of Edinburgh March 13th, 2016 My PhD is supported by Joint work with: Charles Sutton (UoE), Earl T. Barr (UCL), Chris Bird (MSR), Daniel Tarlow (MSRC), Yi Wei (MSRC), Andrew D. Gordon (MSRC)

Upload: others

Post on 21-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Understanding Source Code through Machine Learning to Create Smart Software Engineering Tools

Miltos Allamanis, University of EdinburghMarch 13th, 2016

My PhD is supported byJoint work with: Charles Sutton (UoE), Earl T. Barr (UCL), Chris Bird (MSR), Daniel Tarlow (MSRC), Yi Wei (MSRC), Andrew D. Gordon (MSRC)

Page 2: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

internal & external

codebases

Mine the hidden knowledge to create smart software engineering tools.

Developers implicitly embed knowledge in code that may be useful for the same or other projects.

Page 3: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Machine Learning Models of

Source Code

Software Engineer

Page 4: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

A Spectrum of Problems for Machine Learning

Clustering

Supervised Unsupervised

Page 5: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

A Spectrum of Problems for Machine Learning

Joint Classification Learning Features

Page 6: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Natural Language Processing with Machine Learning

❯ Resolve language ambiguities with principled probabilistic models of language.

❯ Learn model parameters from annotated corpora.

Page 7: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Natural Language Processing (NLP)

Use Machine Learning to model aspects of a natural language.

Models of Aspects of Natural Language

Data: Corpora of Text, Speech etc

Parsing

Named Entity Recognition

Machine Translation

....

Some Knowledge of Linguistics

Page 8: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Machine Learning Models of

Source Code

“All models are wrong, some are useful” - George Box

Machine Learning Models of Aspects of Source Code

CodebasesSoftware Engineers

Software Engineering Tools

Page 9: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Language Modelsfor Source Code

Assign a non-zero probability to every piece of valid code

Probabilities learned from training corpus

Page 10: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Language Models of Source Code – Design Choices

for (int i = 0; i < 10; i++){ Console.WriteLine(i);}

ForStatement

Initialization Expression Expression Body

Single Variable Declaration

Type

int

Name Initializer

NumericLiteral

i

0

Infix Expression

Left Operand

Right Operand

Operator

< NumericLiteral

10

i

Syntactic Models

Token-level Models

Page 11: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

N-gram Language Models

Parameters of ML Model

e.g. P(0 | “for (int i =”)

Page 12: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

How n-gram models see code?package org.cfeclipse.cfml.snippets;

import org.rioproject.examples.logicdesigner.model.getState ( ) { cdl.Choreography;import org.apache.thrift.protocol.TProtocolUtil.skip(iprot);event.newLineCount == 3 ) { case '|' : if ( rule.FireAllRulesCommand;

import org.apache.hadoop.conf.get(0, 0, newByteBuffer, 0, count); } switch ( classifierID ) {

pd.getName() {cBondNeighborsB.get(MODULE).declaringType

= (DEREnumerated) {

jobEntryName.getText("//td[2]/a", RuntimeVariables.replace("//div[@class='lfr-component lfr-menu-list']/ul/li[1]/a" )); } }

Page 13: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Machine Learning

Machine Learning Model

Model Parameters

Learn the parameters of the model from data.

Handle uncertainty and noise.

Designed by humans Learned from data

Page 14: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Learning Model Parameters

Image from marple.eeb.uconn.edu

❯ Optimize objective function in training set

❯ Use computational methods of optimization

Page 15: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Finding a good model

image from http://antianti.org/?p=175

Underfitting Overfitting

Page 16: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Automatic Evaluation in Machine Learning

Imperfect measures of performance such as

❯ Prediction Accuracy ❯ Model Fit

❯ Quantify performance in a reproducible manner

❯ Drive improvement of systems in a measurable way

Page 17: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Source Code and Machine Learning

Coding PatternsMine & exploit common patterns[Hindle et al. 2012,Allamanis & Sutton 2014, Allamanis et al. 2014, 2015]

Probabilistic Static AnalysesProbability Distribution of (Formal) Properties[Raychev et al. 2015,Mangal et al. 2015]

Formal MethodsProbabilities over Search Space (e.g. Synthesis)[Ellis et al. 2015]

Runtime TracesInfer Program Properties from Traces[Brockschmidt et al. 2014Yujia Li et al. 2015]

Code & TextCode search, NL to Code[Yusuke, et al. 2015,Movshovitz-Attias & Cohen, 2013,Allamanis et al. 2014]

Page 18: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Outline

Learning Naming Conventions❯ Lexical Patterns

Learning to Map Natural Language to Source Code❯ Syntactic Patterns

Page 19: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Learning Naming Conventions

“Programs must be written for people to read, and only incidentally for machines to execute.”

- Abelson & Sussman, SICP, preface to the first edition

Page 20: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

A coding convention is a syntactic constraint beyond those imposed by the language grammar.

Allamanis et. al, FSE 2014, FSE 2015ACM Distinguished Paper Award

Page 21: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

The Importance of Coding Conventions

Based on 169 code reviews with 1,093 discussion threads in Microsoft.

Code Review Discussions

Conventions 38%

Naming 24%

Formatting 9%

[Allamanis et al. FSE 2014]

Page 22: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

The Importance of Coding Conventions

Based on 169 code reviews with 1,093 discussion threads in Microsoft.

Code Review Discussions

Conventions 38%

Naming 24%

Formatting 9%

[Allamanis et al. FSE 2014]

Page 23: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Arnaoudova, Venera, L. Eshkevari, Massimiliano Di Penta, Rocco Oliveto, Giuliano Antoniol, and Y. Gueheneuc. "REPENT: Analyzing the nature of identifier renamings." (2014)

94 developers

Is recommending identifier renamings useful?

Page 24: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

A Machine Learning

Perspective

A name reflects important aspects of code functionality.

Learning to name source code elements is a first step in understanding code through machine learning.

Page 25: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Suggestions for junit/src/test/java/junit/tests/runner/TextRunnerTest.java

public class TextRunnerTest extends TestCase {

void execTest(String testClass, boolean success) throws Exception {

...

InputStream i = p.getInputStream();

while ((i.read()) != -1);

...

}

...

}

Page 26: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Suggestions for junit/src/test/java/junit/tests/runner/TextRunnerTest.java

public class TextRunnerTest extends TestCase {

void execTest(String testClass, boolean success) throws Exception {

...

InputStream i = p.getInputStream();

while ((i.read()) != -1);

...

}

...

}

Source CodeLanguage Model

automatically suggest renamings

Page 27: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Suggestions for junit/src/test/java/junit/tests/runner/TextRunnerTest.java

public class TextRunnerTest extends TestCase {

void execTest(String testClass, boolean success) throws Exception {

...

InputStream i = p.getInputStream();

while ((i.read()) != -1);

...

}

...

}

Source CodeLanguage Model

Score by naturalness

&Threshold

1.'i' (18.07%) -> {input(81.93%), }

automatically suggest renamings

Page 28: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Suggesting Names to Developers: The Naturalize Framework

[Allamanis et al. FSE 2014, FSE 2015]

ML model of code

Page 29: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Naturalize Tools - devstyle

devstyle suggests identifier renamings

Page 30: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

18 patches for 5 well known open source projects: 14 accepted, 4 ignored

Page 31: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

libgdx Java Game Development Framework

Method Naming Problem

Page 32: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Method Naming Problem

Names describe what it does not what it isModels need to be “non-local”

Page 33: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

• create • create?UNK? • init • createShaderSuggestions:

Method Naming Problem

Page 34: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Method Naming Problem

• create • create?UNK? • init • createShaderSuggestions:

Page 35: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

A Machine Learning Model of Names

[Mnih, Hinton, 2007; Mnih, Teh, 2012; Mnih, Kavukcuoglu, 2013; Maddison, Tarlow, 2014]

Page 36: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Embedding Identifiers

are “embeddings” ::: model parameters

[Mnih, Hinton, 2007; Mnih, Teh, 2012; Mnih, Kavukcuoglu, 2013; Maddison, Tarlow, 2014]

Page 37: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Embedding Identifiers

[Mnih, Hinton, 2007; Mnih, Teh, 2012; Mnih, Kavukcuoglu, 2013; Maddison, Tarlow, 2014]

Page 38: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Neural Context Models of Source Code

Page 39: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Neural Context Models of Source Code

Page 40: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Neural Context Models of Source Code

Page 41: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Neural Context Models of Source Code

Page 42: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Neural Context Models of Source Code

Page 43: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Neural Context Models of Source Code

Page 44: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Neural Context Models of Source Code

Page 45: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Neural Context Models of Source Code

Global Information

Page 46: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Neural Context Models of Source Code

Local Information

Page 47: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Neural Context Models of Source Code

Page 48: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Embedding Identifiers

[Mnih, Hinton, 2007; Mnih, Teh, 2012; Mnih, Kavukcuoglu, 2013; Maddison, Tarlow, 2014]

Page 49: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Neologisms

Page 50: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

getInputStream get Input Stream

Subtoken Context Models of Code

Sequentially predict each subtoken given the context and the previous

subtokens

Page 51: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Training Data(project)

Train Neural Network

Suggest Names on Test Data

Embeddings

Page 52: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Evaluation Methodology

Suggestions

1. job (30%)2. task (20%)3. tsk (15%)

ForkJoinTask<?> job;if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task;else job = new ForkJoinTask.AdaptedRunnableAction(task);externalPush(job);

Test File

Evaluation on top 10 Java GitHub projects.

Perturb existing code and retrieve ground truth.

Page 53: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Evaluation Methodology

compare with

ground truth

Suggestions

1. job (30%)2. task (20%)3. tsk (15%)

ForkJoinTask<?> job;if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task;else job = new ForkJoinTask.AdaptedRunnableAction(task);externalPush(job);

ForkJoinTask<?> job;if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task;else job = new ForkJoinTask.AdaptedRunnableAction(task);externalPush(job);

Test File

Evaluation on top 10 Java GitHub projects.

Perturb existing code and retrieve ground truth.

Page 54: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Suggesting Variable Names

Page 55: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Suggesting Variable Names

Page 56: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Suggesting Method Names

Page 57: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Suggesting Method Names

Page 58: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Embedding Visualization http://groups.inf.ed.ac.uk/cup/naturalize

Page 59: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Embedding Visualization http://groups.inf.ed.ac.uk/cup/naturalize

Page 60: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Embedding Visualization http://groups.inf.ed.ac.uk/cup/naturalize

Page 61: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Learning to map natural language to source code

Work done in Microsoft Research - Cambridge

Joint work with Danny Tarlow, Yi Wei, Andy Gordon

Page 62: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Applications of Joint Models of Code & NL

Code Retrieval

NL Retrieval for Source Code

and eventually code synthesis...

Page 63: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

A Conditional Generative Model

“get the first letter of each word in string and uppercase”

NL QueryConditional Generative

Model of Source Code

Synthesize/Score Code Snippet

string s;string[] words = s.ToUpper().split(‘ ‘);string[] firstLetters = new string[words.Length];for (int i=0; i < words.Length; i++) {

firstLetters[i] = words.Substring(0,1);}

Page 64: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

ForStatement

Initialization Expression Expression Body

Single Variable Declaration

Type

int

Name Initializer

NumericLiteral

i

0

Infix Expression

Left Operand

Right Operand

Operator

< NumericLiteral

10

i

Syntactic model of source code, i.e. model how AST is generated

Page 65: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Tree Generation Model:Context Free Grammars (CFG)

n

c

Page 66: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

n

c

Tree Generation Model:Probabilistic Context Free Grammars (PCFG)

Page 67: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Generating from a PCFGForStatement

Page 68: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Generating from a PCFGForStatement

Initialization Expression Expression Body

Page 69: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Generating from a PCFGForStatement

Initialization Expression Expression Body

Single Variable Declaration

Page 70: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Generating from a PCFGForStatement

Initialization Expression Expression Body

Single Variable Declaration

Type Name Initializer

Page 71: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Generating from a PCFGForStatement

Initialization Expression Expression Body

Single Variable Declaration

Type

int

Name Initializer

Page 72: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Generating from a PCFGForStatement

Initialization Expression Expression Body

Single Variable Declaration

Type

int

Name Initializer

NumericLiteral

i

0

Infix Expression

Left Operand

Right Operand

Operator

< NumericLiteral

10

i

Page 73: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Conditional Generative

Modelof Source Code

Given natural language, get a model that can generate (probabilistically) source code, i.e.

P(code | natural language)

Page 74: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

A Neural Log-Bilinear Bimodal Model of Code

Kiros, Ryan, Ruslan Salakhutdinov, and Rich Zemel. "Multimodal neural language models."

Maddison, Chris and Daniel Tarlow. "Structured generative models of natural source code."

Page 75: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

A Neural Log-Bilinear Bimodal Model of Code

Kiros, Ryan, Ruslan Salakhutdinov, and Rich Zemel. "Multimodal neural language models."

Maddison, Chris and Daniel Tarlow. "Structured generative models of natural source code."

Page 76: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

A Neural Log-Bilinear Bimodal Model of Code

Kiros, Ryan, Ruslan Salakhutdinov, and Rich Zemel. "Multimodal neural language models."

Maddison, Chris and Daniel Tarlow. "Structured generative models of natural source code."

Page 77: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

StackOverflow Data & Augmenting Data with Queries

Page 78: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Natural Language “Query”

Code Snippets

Page 79: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

30K C# Questions40K C# Snippets

Page 80: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

C# enum

C# foreach enum

C# enumerate enum

How to order enum values in C#

foreach enum C#

C# enumerate enumerations

C# enumeration

C# enumerator

C# foreach enum values

C# enumerate

http://stackoverflow.com/questions/105372/how-do-i-enumerate-an-enum

Page 81: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

40,092 C# Snippets6,355,393 Natural Language Queries

Page 82: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Performance Metric: Mean Reciprocal Rank

Measures how well we rank the correct answer

Page 83: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Model StackOverflow Test 1

StackOverflow Test 2

NL+Code 0.18 0.17

NL only 0.12 0.13

Retrieval Evaluation - MRR Performance

Model StackOverflow Test 1

StackOverflow Test 2

Multiplicative 0.43 0.41

NL only 0.25 0.26

Cod

e R

etrie

val

Que

ry

Ret

rieva

l

Test 1: Code snippets from training set with new natural language queries.Test 2: New code snippets and new natural language queries.

Page 84: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Synthesis Samples

> timespan day the weekDateTime DateTime=DateTime.Now(0);

> file exists on directoryvar path = new File(directory)

Page 85: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Synthesis Samples

> timespan day the weekDateTime DateTime=DateTime.Now(0);

> file exists on directoryvar path = new File(directory)

Page 86: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Retrieval Sample

Page 87: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Challenges

❯ Code Representations in Machine

Learning

❯ Define Representative Evaluation

Metrics for Software Engineering Tasks

❯ Create Useful and Efficient Software

Engineering Tools

Page 88: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Understanding Source Code through Machine Learning to Create Smart Software Engineering

Tools

Page 89: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

(end)

Page 90: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Learning Naming ConventionsAllamanis et al, 2014; 2015

Mining Source Code IdiomsAllamanis and Sutton, 2014Learning to name source code

n-gram LMs for CodeAllamanis et al, 2013

Page 91: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

The Sympathetic Uniqueness Principle

● Prune rare words● Repurpose special UNK token● Allows Naturalize to decide when it should

not suggest

public void execute(Runnable task) { if (task == null) throw new NullPointerException(); ForkJoinTask<?> job; if (task instanceof ForkJoinTask<?>) // avoid re-wrap job = (ForkJoinTask<?>) task; else job = new ForkJoinTask.AdaptedRunnableAction(task); externalPush(job); }

Rare names often usefully signify unusual functionality, and need to be preserved.

Page 92: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Idioms vs. the RestCode Clones copy-paste code fragmentsC. K. Roy et al. Comparison and evaluation of code clone detection techniques and tools: A

qualitative approach. Science of Computer Programming, 2009.L. Jiang et al. Deckard: Scalable and accurate tree-based detection of code clones. ICSE 2007.H. A. Basit and S. Jarzabek. A data mining approach for detecting higher-level clones in

software. IEEE Transactions on Software Engineering, 2009.

API Patterns usage patterns of methodsT. T. Nguyen et al. Graph-based mining of multiple object usage patterns. ESEC/FSE 2009.J. Wang et al. Mining succinct and high-coverage API usage patterns from source code. MSR

2013.H. Zhong et al. MAPO: Mining and recommending API usage patterns. ECOOP, 2009.

Idioms syntactic code fragments

Page 93: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

The Distributional Hypothesis

“You shall know a word by the company it keeps”.

John Rupert Firth, 1957

Page 94: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

The Distributional Hypothesis

“You shall know a word by the company it keeps”.

John Rupert Firth, 1957

The ????????? is walking

Page 95: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

For IWESEP, in a way I'm inviting you as a ML/NLP person that can teach software engineering people the cool things you can do with ML/NLP techniques. I'm not sure if you see yourself this way, but I think a quick "intro to ML/NLP" for the first 1/3 or so, then "look at all the cool things you can do" for the second 2/3 would be one potential way to give the presentation.

Page 96: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

Sanity Check:

String Manipulation Synthetic Data

var result = input_string.Split(' ').Select((string x) => Double.parse(x))).Average();

each element parse double separated by a space and get meaneach element parse double separated by a space and get averageeach element convert to double separated by a space and get meaneach element convert to double separated by a space and get averageeach element parse to double separated by a space and get mean

Page 97: Create Smart Software Engineering Tools through Machine ... · Machine Learning Models of Aspects of Source Code Software Engineers Codebases Software Engineering Tools. Language

A Neural Log-Bilinear Bimodal Model of Code

Kiros, Ryan, Ruslan Salakhutdinov, and Rich Zemel. "Multimodal neural language models."

Maddison, Chris and Daniel Tarlow. "Structured generative models of natural source code."