new the “naturalness” of software: a research...

54
The “Naturalness” of Software: A Research Vision Abram Hindle, Earl Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu

Upload: others

Post on 26-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

The “Naturalness” of Software: A Research Vision

Abram Hindle, Earl Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu

Page 2: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

100%

Natural100%

Natural

public class FunctionCall {! public static void funct1 () {!! System.out.println ("Inside funct1");! }! public static void main (String[] args) {! int val;! System.out.println ("Inside main");! funct1();! System.out.println ("About to call funct2");! val = funct2(8);! System.out.println ("funct2 returned a value of " + val);! System.out.println ("About to call funct2 again");! val = funct2(-3);! System.out.println ("funct2 returned a value of " + val);! }! public static int funct2 (int param) {! System.out.println ("Inside funct2 with param " + param);! return param * 2;! }!}!!

100%

Natural

Page 3: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

English, Tamil, GermanCan be rich, powerful, expressive

Statistical Models

..but “in nature” are mostly simple, repetitive, boring

Page 4: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Can be rich, powerful, expressive

Mostly simple, repetitive, boring

Statistical Models

Page 5: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Two Examples

A speech recognizer example

Another speech recognizer example

“European Central Fish” ?

“fish++” ?

Page 6: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Repetition

Mathematical Models

Useful Software

Page 7: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Is software really repetitive?

Page 8: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

The “Uniqueness” of Code

Mark Gabel Zhendong Su

A study of the Uniqueness of Source Code, Gabel and Su, ACM SIGSOFT FSE 2010

Page 9: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

How Redundant is Code?

How much code? 6000 projects (C, C++, Java) 430,000,000 LOC

How long? Sequences of 6-77 token length

(1) How matched? Exact Match 1-4 edits

(2) How matched? Raw Tokens Renamed Identifiers

Page 10: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Non-Uniqueness (Redundancy) in a Large Java Corpus

Perc

ent

Red

unda

ncy

0

10

20

30

40

50

60

70

80

90

100

Length of Candidate Code Fragment in Tokens

5 20 35 50 65 80

Identifiers RenamedExact Tokens

Page 11: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Software is really repetitive.

How can we use this?

Page 12: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

How has the “naturalness”

(repetitive structure) of natural language

been exploited?

Page 13: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Large Corpora

Language Models

Speech Recognition, Translation, etc.

Page 14: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Language ModelsFor any utterance U, 0 p(U) 1

p(“EuropeanCentralF ish”) < p(“EuropeanCentralBank”)

If Ua is often uttered than U

b,

!p(Ua) > p(Ub)

p(for(i = 0; i < 10; fish + +)) < p(for(i = 0; i < 10; i + +))

Page 15: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

History of Language Models in NLP

• Initially, “Rationalist Methods” based on linguistic and logical theories...

• ...enter the “Empiricist” approach

✓Most “natural” utterances are repetitive, simple

✓Faster computers + large on-line corpora

➡ Good, high quality language models

➡ Rapid, revolutionary advances

“Every time I fire a linguist, the performance of our

speech recognizer goes up” —Fred Jelenik

Page 16: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Good Language Models have been used for:

➡Speech recognition

➡Natural language translation

➡Document summarization

➡Document retrieval

Language Models: a Revolution in NLPThe design and estimation

of language models is at the heart

of modern NLP

Page 17: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

But what about code? and

“code language models”?

Page 18: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Exploiting Code Language Models

Suggest the next token for developers

Complete the current token for developers

Assistive (speech, gesture) coding

Summarization and retrieval as translation

Fast, “good guess” static analysis

Search-based Software Engineering

Page 19: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Building a Language Model Large Text Corpus

(Training) Statistical

Model Design

EstimationAlgorithm

Model

EvaluationModel Quality

Large Text Corpus(Test)

Estimated using frequency of occurrence!

Page 20: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

What a Language Model Does

LanguageModel

..of the European Central Bank

p(of) p(the) p(European) p(Central) p(Bank)p(Bank)p(European) p(Central)

Page 21: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Language Models

Vastly more complex

Novel, NLP-specific estimation methods

Almost always face data-sparsity

Page 22: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Evaluating Language Model Quality

The words it encounters are not “too surprising” to it.

Frequently encountered language events are assigned higher probability

Infrequent language events are assigned lower probability.

....measured using “Cross-Entropy”

Page 23: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Background Cross Entropy

LanguageModel

public class FunctionCall {! public static void funct1 () {!! System.out.println ("Inside funct1");! }! public static void main (String[] args) {! int val;! System.out.println ("Inside main");! funct1();! System.out.println ("About to call funct2");! val = funct2(8);! System.out.println ("funct2 returned a value of " + val);! System.out.println ("About to call funct2 again");! val = funct2(-3);! System.out.println ("funct2 returned a value of " + val);! }! public static int funct2 (int param) {! System.out.println ("Inside funct2 with param " + param);! return param * 2;! }!}!!

Good Description?

Page 24: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Background Cross Entropy

LanguageModel

public class FunctionCall {! public static void funct1 () {!! System.out.println ("Inside funct1");! }! public static void main (String[] args) {! int val;! System.out.println ("Inside main");! funct1();! System.out.println ("About to call funct2");! val = funct2(8);! System.out.println ("funct2 returned a value of " + val);! System.out.println ("About to call funct2 again");! val = funct2(-3);! System.out.println ("funct2 returned a value of " + val);! }! public static int funct2 (int param) {! System.out.println ("Inside funct2 with param " + param);! return param * 2;! }!}!!

Good Description?

Low Cross Entropy!!

Page 25: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Background Cross Entropy

LanguageModel

public class FunctionCall {! public static void funct1 () {!! System.out.println ("Inside funct1");! }! public static void main (String[] args) {! int val;! System.out.println ("Inside main");! funct1();! System.out.println ("About to call funct2");! val = funct2(8);! System.out.println ("funct2 returned a value of " + val);! System.out.println ("About to call funct2 again");! val = funct2(-3);! System.out.println ("funct2 returned a value of " + val);! }! public static int funct2 (int param) {! System.out.println ("Inside funct2 with param " + param);! return param * 2;! }!}!!

Good Description?

High Cross Entropy!!

Page 26: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Cross entropy

Measuring Goodness: Cross entropy

1

n

nX

i=1

�log(p(ei))

Lower if Model assigns

High-Probability to frequent events

Higher if Model assigns

Low-Probability to frequent events

For a document with

n words

probability assigned by

Model to word

Page 27: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

What language model

gives low cross-entropy?

Page 28: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

n-gram models

• Intuition: Local Context Helps

• Examples (NL, then code)

• multiple choice question

• item = item→next

!

What is This?

What is This?

More context helps more!

Page 29: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

n-gram models of code: Experimental Results

Page 30: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Java Datasets

Page 31: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

C Datasets

Page 32: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

N-gram Cross Entropy

0

2.5

5

7.5

10

1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram

English Code

Five Bits!

3-4 Bits!

Page 33: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

The Skeptic Asks...Is it just that C, Java, Python... are simpler than English?

➡ Do cross-project testing!

➡ Train on one project, test on the others

➡ If it’s all “in the language”, entropy should be similar

Page 34: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Train on one project, test on the others.

Page 35: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3 Xalan−J Xerces2

24

68

10

12

14

Corpus Projects

Cro

ss E

ntr

opy

Self Cross Entropy

Page 36: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Train on one Ubuntu application domain,

test on the others.

Page 37: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Admin Doc Graphics Interpreters Mail Net Sound Tex Text Web

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

Corpus Categories

Cro

ss E

ntr

opy

Self Cross Entropy

116

22 21

23 1586 26

135 118

31

Page 38: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Suggest the next token for developers

Complete the current token for developers

Assistive (speech, gesture) coding

Summarization and retrieval as translation

Stupid, statistical, static analysis

Search-based Software Engineering

Suggest the next token for developers

The “Naturalness” Vision

Page 39: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Suggesting Tokens

What token could appear here?

What token has most often appeared here?

Uses Type, Scope,

Etc !

Use just previous

two tokens!

Page 40: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Do n-grams help?Eclipse

Suggestion Engine

Suggestion from

Language Model

MergeAlgorithm

Evaluation

Additional Benefit.

from Language

Model

Test Set (existing code)

Additional Benefit from

Language Model

Page 41: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

How many more correct suggestions?

Suggestion1

Suggestion2

Suggestion1

Suggestion2

Suggestion3

Suggestion4

Suggestion5

Suggestion6

Suggestion1

Suggestion2

Suggestion3

Suggestion4

Suggestion5

Suggestion6

Suggestion7

Suggestion8

Suggestion9

Suggestion10

Language Models ALWAYS

improve performance

more

Page 42: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

● ● ●

● ●● ●

0

20

40

60

80

100

120

Pe

rce

nt

Ga

in o

ver

Ecl

ipse

Raw

Ga

in (

cou

nt)

0

1000

2000

3000

4000

3 4 5 6 7 8 9 10 11 12 13 14 15

Suggestion Length

● Percent GainRaw Gain (count)

Suggestion1

Suggestion2Improved performance at every token length

Page 43: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

N-Gram suggestions

always add value to

the native Eclipse suggestion engine,

in a very large trial.

Page 44: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Can be rich, powerful, expressive

Mostly simple, repetitive, boring

Statistical Models

Page 45: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Suggest the next token for developers

Complete the current token for developers

Assistive (speech, gesture) coding

Summarization and retrieval as translation

Fast, “good guess” static analysis

Search-based Software Engineering

Suggest the next token for developers

The “Naturalness” Vision

????? ?????

Page 46: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Assisted Coding

Eclipse

Dasher++

Rachel Aurand

(Graduate Student)

Page 47: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3
Page 48: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

The “Naturalness” Vision

Suggest the next token for developers

Complete the current token for developers

Assistive (speech, gesture) coding

Summarization and retrieval as translation

Fast, “good guess” static analysis

Search-based Software Engineering

Page 49: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Noisy Channel Model

p(E | F ) =p(F | E).p(E)

p(F )

“Comment allez vous? ”What was the most likely English

sentence he was trying to say?

He’s trying to speak English, but it is systematically “messed up” into French.

Oh, it must be:

“Fine, thank you, How are you?”

!

May be he’s saying: “Do you comment all your code?”

Page 50: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Most Likely English Sentence

Most Likely way

“it got messed up”

Maximize Numerator over “E” to get best translation

Normalizing Constant

p(E | F ) =p(F | E).p(E)

p(F )

Page 51: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

English LanguageModel

Joint Distribution from Aligned Corpus

Normalizing Constant

p(E | F ) =p(F | E).p(E)

p(F )

Where do the probability distributions

come from?

Page 52: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

Noisy Channel ModelToast.makeText(context,

“hello”, 5).show(); What was the most likely English summary of this code?

He’s trying to speak English, but it comes out funny-sounding

Oh, it must be:

“Pop up a message window!”

Maybe his code means “Make me some toast?”

p(E | C) =p(C | E).p(E)

p(C)

Page 53: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

“Domain-Specific”English Language

Model

Code-English Joint Corpus

Normalizing Constant

Where do the probability distributions

come from?

p(E | C) =p(C | E).p(E)

p(C)

Page 54: New The “Naturalness” of Software: A Research Visioncrest.cs.ucl.ac.uk/cow/29/slides/COW29_Barr.pdf · 2013. 12. 10. · Ant Batik Cassandra Eclipse Log4j Lucene Maven2 Maven3

The “Naturalness” Vision

Suggest or Complete next tokens

Assistive (speech, gesture) coding

Summarization and Retrieval as Translation

Learn and Enforce Coding Conventions

Syntax Errors

Machine Translation for Porting

Fast, “good guess” static analysis

Search-based Software Engineering