ieeee transactions on software engineering, vol. x,...

0098-5589 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for moreinformation.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2017.2750682,IEEE Transactions on Software Engineering

IEEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. X, NO. X, XX 20XX 1

Expanding Queries for Code Search UsingSemantically Related API Class-names

Feng Zhang, Haoran Niu, Iman Keivanloo, Member, IEEE , and Ying Zou, Member, IEEE

Abstract—When encountering unfamiliar programming tasks (e.g., connecting to a database), there is a need to seek potentialworking code examples. Instead of using code search engines, software developers usually post related programming questionson online Q&A forums (e.g., Stack Overflow). One possible reason is that existing code search engines would return effectivecode examples only if a query contains identifiers (e.g., class or method names). In other words, existing code search enginesdo not handle natural-language queries well (e.g., a description of a programming task). However, developers may not know theappropriate identifiers at the time of the search. As the demand of searching code examples is increasing, it is of significantinterest to enhance code search engines. We conjecture that expanding natural-language queries with their semantically relatedidentifiers has a great potential to enhance code search engines. In this paper, we propose an automated approach to findidentifiers (in particular API class-names) that are semantically related to a given natural-language query. We evaluate theeffectiveness of our approach using 74 queries on a corpus of 23,677,216 code snippets that are extracted from 24,666 opensource Java projects. The results show that our approach can effectively recommend semantically related API class-names toexpand the original natural-language queries. For instance, our approach successfully retrieves relevant code examples in thetop 10 retrieved results for 76% of 74 queries, while it is 36% when using the original natural-language query; and the medianrank of the first relevant code example is increased from 22 to 7.

Index Terms—Query Expansion, Code Search, Neural Network Language Model, API Class-name

F

1 INTRODUCTION

DURING the process of software development, de-velopers need to accomplish different program-

ming tasks, such as connecting to a database, anduploading a file to a server. Many approaches (e.g.,[6, 16, 35]) have been proposed to help developerssearch relevant source code. For examples, GitHuband Koders are two code search engines that re-turn source code snippets for given keywords. Thekeywords are usually identifiers, such as class andmethod names. Recent work by Stolee et al. [33]describes a new search model other than the key-word based engines, i.e., let developers specify thebehaviour of the target code, such as defining theexpected input and output.

The programming tasks encountered by one devel-oper might have already been encountered by oth-ers [11]. Developers often seek help from other devel-opers for potential solutions (e.g., code examples). Acode example is a snippet of source code that has afew lines of code and is usually used to show howa programming task is implemented [19]. To get apotential solution from a large community of devel-opers, developers often ask programming questions

• F. Zhang is with the School of Computing, Queen’s University,Kingston, Canada.E-mail: [email protected].

• H. Niu, I. Keivanloo and Y. Zou are with the Department of Electricaland Computer Engineering, Queen’s University, Kingston, Canada.E-mail: {13hn, ying.zou}@queensu.ca, and [email protected].

on online Q&A forums (e.g., Stack Overflow, herebySO) [11]. Such programming questions are presentedin the form of natural language. For instance, whena developer looks for code examples that show theimplementation of playing sound, he or she may askthe following question: “How can I play sound usingClip in Java?” (see Fig. 1). The increasing popular-ity of online Q&A forums indicates the increasingneed of searching code examples. However, existingcode search engines mainly support keyword-basedqueries that consist of a set of search terms (e.g.,class or method names). Therefore, we think there isa significant interest to enhance existing code searchengines by expanding natural-language queries withidentifiers (e.g., class and method names). In addition,an improved code search engine can help companiesto better manage code reuse and share knowledge inlarge scale proprietary software systems (e.g., a cartypically has about 100 million lines of code [27]).

Our preliminary study (see Section 2) shows thatthe proportion of identifiers plays a key role in retriev-ing relevant code examples from code search engines.As an example shown in Fig. 1, after adding fiveidentifiers (i.e., “Player”, “AudioInputStream”, “LineU-navailableException”, “Game”, and “AudioClip”) to theinitial query, a code example that better matches theinitial query is returned. There are various approachesto find terms (e.g., identifiers) for query expansionin the software engineering area, and they can begrouped into two categories [14]: 1) semi-automatedapproaches that extract terms (e.g., identifiers) onlyfrom the answers that have been marked by devel-




Initial query:How can I play sound using Clip in Java?

⇓The first code example returned for the initial query:

1 private static boolean playSound(String soundUrl)2 throws SlickException3 {4 boolean resourceFound = true;5 Sound sound;6 try {7 sound = new Sound(soundUrl);8 } catch (SlickException e) {9 System.out.println("Did not find the

10 following resource:"+soundUrl);11 sound = new Sound("sound/error.ogg");12 resourceFound = false;13 }14 sound.play();15 while (sound.playing()) {16 try { Thread.sleep(100);17 } catch (InterruptedException e) {18 e.printStackTrace();19 }20 }21 return resourceFound;22 }

Expanded query:How can I play sound using Clip in Java? + Player AudioInput-Stream LineUnavailableException Game AudioClip

⇓The first code example returned for the expanded query:

1 public static void playAudioClip(String clipFilename) {2 Clip m_clip;3 AudioInputStream audioInputStream = null;4 try {5 audioInputStream = AudioSystem.getAudioInputStream6 (new FileInputStream(clipFilename));7 if (audioInputStream != null) {8 AudioFormat format =9 audioInputStream.getFormat();

10 DataLine.Info info =11 new DataLine.Info(Clip.class, format);12 m_clip = (Clip) AudioSystem.getLine(info);13 m_clip.open(audioInputStream);14 m_clip.start();15 } else {16 System.out.println("ClipPlayer.<init>(): can’t17 get data from file "+clipFilename);18 }19 } catch (LineUnavailableException lineException) {20 Toolkit.getDefaultToolkit().beep();21 } catch (Exception e) { e.printStackTrace(); }22 }

Fig. 1. The first code example returned for the initial (left) and the expanded (right) queries, respectively.

opers as relevant to the initial query; and 2) auto-mated approaches that use natural language process-ing (NLP) models trained from the general Englishtext to identify terms that are semantically related tothe initial query [43].

In this paper, we propose an automated approachthat recommends semantically related API1 class-names to expand a natural-language query througha neural network language model. We focus on APIclass-names because applications with similar func-tionalities often use the same set of API classes.The approach is based on a continuous bag-of-wordsmodel [25] that can learn the semantic representationsof API class-names from the source code. We computethe semantic distances between the terms of the initialquery and API class-names, and select the top seman-tically related API class-names to expand the initialquery.

Our major contributions are presented as follows:• Conduct a preliminary study using a represen-

tative sample of 384 SO questions. We observethat code search related questions are prevalent inSO, and such questions are described in naturallanguage with a small proportion of identifiers(e.g., class names).

• Propose an automated approach to expandnatural-language code search queries with se-mantically related API class-names, using a neu-ral network language model.

• Evaluate our approach using a corpus of23,677,216 code snippets. The results show that

1. API (application program interface) is a set of common classesand methods that are generally used to create software applications.

our approach can significantly enhance codesearch engines on natural-language queries.Specifically, the median position of the first cor-rect answer in the returned list is lifted from 22to 2 for the 74 studied queries. The improvementis desirable since developers always only look atthe top results.

Paper organization. Section 2 shows our motivatingstudies. Section 3 describes the details of our ap-proach. The case study design, results, and the threatsto validity of our study are presented in Sections 4, 5,and 6, respectively. We discuss related work Section 7,and conclude in Section 8.

2 MOTIVATING STUDY

In this section, we perform a preliminary study to findif it is desirable to support natural-language queriesin code search engines and what is a possible way toexpand natural-language queries.

RQ1. Is it common to seek code examples by post-ing descriptive questions on StackOverflow?Motivation. As code search engines do not supportnatural-language queries well, developers often seekhelp from fellow peers by posting descriptive ques-tions on online Q&A forums (e.g., SO). In this researchquestion, we classify questions posted on SO intovarious categories and quantify the number of ques-tions related to code search. The prevalence of codesearch questions could indicate that it is a commonpractice for developers to search for code exampleswith only descriptive questions on SO, and therefore




it is desirable to support natural-language queries incode search engines.Approach. To address this preliminary question, weintended to restrict the questions related to “Java”,which is the most popular object-oriented program-ming language [37]. We extract 725,492 questions fromStack Overflow that were tagged with "Java" andwere posted between August 2008 to October 2014.To reflect the distribution of the 725,492 SO questionsat the 95%±5% confidence level, a sample set of 384questions are needed [3, 7]. Therefore, we randomlyselect 384 questions from the entire population tocreate a representative sample.

We manually review and classify the sampled ques-tions. Inspired by the types of questions that areidentified by Stolee et al. [33], we categorize questionsinto the following six categories:

(C1) Code search that contains questions on how tofulfil a programming task.

(C2) Concept explanation that includes questions re-garding some programming concepts or tech-niques.

(C3) Debugging that has questions dealing with therun-time errors or unexpected exceptions.

(C4) Procedure that contains questions seeking a sce-nario to accomplish a certain task which is notrelated to programming.

(C5) Suggestion that includes questions looking for theavailability of certain tools.

(C6) Discussion that has questions discussing perfor-mance or programming languages, etc.

Finally, we compute the proportion of the questionsbelonging to each of the six categories. The proportioncan reflect the popularity of questions related to Codesearch on Stack Overflow.Findings. The dominant category of questions onSO is Code search. The distribution of questions ineach category is shown in Table 1. We observe that31% of the sample questions belong to the categoryCode search. This observation indicates that SO isfrequently used by developers when seeking solutionsfor a particular programming task. However, ques-tions posted on forums like SO are answered by peerdevelopers. It is not guaranteed that each questioncan be correctly answered in a timely manner. Onthe other hand, code search engines instantly returnanswers for a query. If using Code search questions asqueries is well supported by code search engines, itcould considerably relieve programming burdens ofdevelopers.

We observe that Code search related questionsdescribe a particular programming task in a natural-language way. For instance, to complete the taskof subtracting three months from a given object ofclass “java.util.Date”, an example question is askedas: “java.util.Date - Deleting three months from a

TABLE 1The number of sampled questions of each category.

Category of questions Number of questions (%)

(C1) Code search 118 (31%)(C2) Concept explanation 26 (7%)(C3) Debugging 62 (16%)(C4) Procedure 104 (27%)(C5) Suggestion 39 (10%)(C6) Discussion 35 (9%)

date?”2. There is no consistent pattern of expressingthe Code search questions. For example, there are only29% (i.e., 34 out of 118) of Code search questionscontain the word “how” in the sampled 384 questions.In the aforementioned example question, there is anidentifier “java.util.Date”. However, we observe thatidentifiers do not necessarily appear in questions,such as “Converting alphanumeric to ascii and incre-menting”3.

Developers tend to prefer answers with codeexamples. We find that the accepted answers of Codesearch questions on SO usually contain code examples.Particularly, 62% (73 out of 118) of Code search ques-tions contain code examples. This ratio is close to theobservation by Stolee et al. [33], which is 63%.�

�

�

To accomplish a particular programming task, it is acommon practice of developers to seek code exampleson SO by asking descriptive questions. Supporting touse such questions as queries in code search enginespotentially accelerates the process of solution finding.

RQ2. Does including identifiers help code search?

Motivation. By exploring Stack Overflow, we observethat Code search questions are usually composed ofnatural-language terms, and do not necessarily con-tain identifiers (i.e., class or method names). Whenusing Code search questions as queries, our intuitivesolution to improve the search efficiency is to extendthe initial question with appropriate identifiers. In thisquestion, we aim to examine if our intuitive solutionmakes sense, i.e., if including identifiers in queries canhelp find code examples.Approach. To address this question, we analyze alarge number of code search queries with and withoutidentifiers. The queries are extracted from a year-long (i.e., from 2007-01-01 to 2007-12-31) usage logdata of Koders.com4, which was a very popular codesearch engine with over three million unique visitorsin 2007 [1]. The usage log data was collected by

2. http://stackoverflow.com/questions/1311143/java-util-date-deleting-three-months-from-a-date

3. http://stackoverflow.com/questions/1383132/converting-alphanumeric-to-ascii-and-incrementing

4. Koders (now becomes “Open Hub”) : http://code.openhub.net




Bajracharya and Lopes [1], and there are 628,862 codesearch queries in the data.

The usage log records two types of activities ofusers: 1) a search activity represents a user’s query;and 2) a download activity happens if a user clicksone of the query results to view the code snippet. Adownload activity indicates that the user is interestedon the clicked code snippet and uses the code snippetin some way.

We consider that a query is answered if at least onecode snippet is clicked. The underlying assumption isthat a user would click the code snippet only when theuser thinks a code snippet is relevant to his/her query.This assumption has been used in prior studies (e.g.,[1, 17, 18]) on both code search and general purposesearch for finding relevant results.

To determine if a query contains identifiers, weexamine if a query contains any of the threefields: “mdef”, “cdef” and “idef” that are used byKoders.com to specify method, class and interfacenames in a query, respectively.

To evaluate whether including identifiers in codesearch queries impacts the efficiency of code searchengines, we split the 628,862 Koders queries into twogroups: 1) one group contains all queries having oneor more identifiers; and 2) the other group containsall remaining queries. Then we test the following nullhypothesis, using the Chi-squared test [13] with the95% confidence level (i.e., p-value < 0.05):H20: the proportion of answered queries does not differ

between the two groups of queries with and without iden-tifiers.The Chi-squared test assesses if there is a relationshipbetween two categorical variables [13]. If there is a sig-nificant difference, we reject the null hypothesis andfurther compute odds ratio (OR) [30]. The odds ratiomeasures how strongly the presence of identifiers inqueries associated with the success of getting answers.OR = 1 indicates that the presence of identifiers doesnot affect the search efficiency; OR > 1 shows thatthe presence of identifiers can increase the likelihoodof having a query answered; and OR < 1 means thatthe presence of identifiers negatively affects the searchefficiency.Findings. Including identifiers (i.e., class or methodnames) in a query significantly increases the chancefor retrieving relevant code examples. The p-valueof the Chi-squared test is less than 2.2e-16, thus wereject the null hypothesis H20 and conclude that theproportion of answered queries is significantly differ-ent between the groups with or without identifiers.Table 2 presents the detailed number of queries.

The odds ratio is 2.23 (i.e., OR > 1). This indicatesthat using queries with identifiers is 2.23 times morelikely to get relevant code examples from code searchengines than using queries without identifiers. Theratio of successful (i.e., answered) to unsuccessful(i.e., unanswered) queries is 32% among queries with

TABLE 2The number of answered and unanswered queries in

groups of queries with or without identifiers.

Answered (%) Unanswered (%) AnsweredUnanswered

Queries withidentifiers

5,090 (0.8%) 15,810 (2.5%) 32%

Queries with-out identifiers

76,847 (12.2%) 531,115 (84.5%) 14.5%

identifiers, which is much higher than the one amongqueries without identifiers (i.e., 14.5%). This observa-tion motivates us to expand Code search questions withidentifiers, towards helping developers find relevantcode examples with natural-language queries.��

��

Including identifiers can significantly improve thesearch efficiency.

Summary of the motivating study. When performinga particular programming task, a common way fordevelopers to find relevant code examples is to postCode search questions on SO. However, questions onSO are not guaranteed to be answered immediately.To get instant answers (i.e., relevant code examples),it is desirable to use Code search questions as querieson code search engines.

Moreover, we observe that having identifiers ina query significantly improves the search efficiency.Considering that Code search questions are usuallydescribed in a natural-language way, extending thequestions with identifiers may improve the efficiencyof searching code examples with Code search questions.Therefore, we propose an approach to automaticallyexpand natural-language queries with the semanti-cally related identifiers (particularly API class-names).

3 OUR APPROACH

In this section, we present the details of our approachfor query expansion.

3.1 OverviewFig. 2 depicts the overview of our approach. Ourapproach contains three major steps:

1) extracting vector representations that are used tomap natural-language queries with identifiers;

2) applying the model to find relevant identifiers toexpand the given query;

3) executing the expanded query to retrieve codesnippet lists from the code snippet corpus.

3.2 Extracting Vector RepresentationsTo map natural-language queries with identifiers, weextract vector representations of words by applyingthe continuous bag-of-words model (CBOW) [25].




Codesnippetcorpus

Codesnapshot

Trainingdata

Extractidentifiers Identifiers

Applyneural

networklanguage

model

Vectorrepresen-tation set

Extract vectorrepresentations

Vectors ofquery andidentifiers

Extractrelevant

identifiers

Relevantidentifiers

Expandquery

Apply incode search

engine

List of thereturned code

examples

Userquery

Fig. 2. The process of expanding natural-language queries.

Specifically, the CBOW model is used to find if twowords describe the similar functionality of code.

The CBOW model is a neural network model thatcan infer semantically similar words from the con-text of words (i.e., their surrounding words). Forinstance, we assume that the corpus contains (“the”,“girl”, “was”, “walking”) and (“the”, “boy”, “was”,“walking”). The CBOW model learns that “girl” and“boy” are semantically similar, as they have the samecontext (“the”, “was”, “walking”). More backgroundon the neural natural-language model is presented inAppendix ??.

The details of extracting vector representations aredescribed as follows.

1) Extracting identifiers. Identifiers include classand method names. Comparing to user-definedclasses or API methods, API classes are more likelyto appear in many code examples. In the otherwords, using API classes can retrieve more relevantcode examples. Therefore, we choose to use only theAPI class-names as identifiers to expand the originalnatural-language queries. Specifically, we only use theclasses that are described in Java API documentation.

To extract API class-names from a Java file, we parseall “import” statements to get the class-name. The“import” statement (e.g., “import java.util.Arrays”)specifies the Java class that is used in a Java file, andcomprises of both the package name and the classname. The “import” statements that do not specify theJava class, such as “import java.util.*”, are excluded.We process all Java files to extract all API class-namesthat exist in our corpus. In total, 15,085 class-namesare extracted, including 97% (i.e., 3,892 out of 4,024)of the Java API classes listed in the Java API doc.

2) Creating the training data for the CBOWmodel. We tokenize each code snippet (i.e., a completemethod) in the corpus through camel case splitting.The camel case splitting (e.g., “DescriptionAttribute”to “Description” and “Attribute”) has been used inprior studies [20, 31] to extract words from identifiers.As we propose to extend natural-language querieswith API class-names, we keep them as is. Therefore,a term is either a word or an intact class-name.

We normalize the tokenized terms by removingstop words and stemming. Our stop words includeboth common English stop-words (e.g., “a”, “the”, and“is”) and programming keywords (e.g., “if”, “else”,and “for”). We use the Porter stemmer for stemmingthe tokenized terms, such as converting the words“stems”, “stemmer”, “stemming” and “stemmed” tothe word “stem”. All the normalized terms make upthe corpus for training the CBOW model.

3) Extracting vector representations. We choose touse Word2vec5 that is an efficient implementation ofthe CBOW model [25]. Word2vec takes the corpus ofthe normalized terms as the training data to buildthe model, and produces a set of term vectors as theoutput.

The resulting term vectors include the vectors ofall the terms that appear in the training data. Eachelement of a term vector is a float value, which reflectsone semantic feature of the term.

3.3 Expanding the Natural-Language QueriesTo expand queries, we first need to obtain the vec-tor representation of a natural-language query. Thenbased on similarities between the query and API class-names, we find relevant API class-names from thecorpus. Details are described as follows.

1) Obtaining the vector representation of anatural-language query. For each natural-languagequery, we perform the camel case splitting, stopword removal and stemming. We use a vector Vq =(t1, t2, ..., ti, ..., tm) to represent a normalized natural-language query, where m is the total number of termsin the normalized query, and ti represents the ith term.For example, the natural-language query “How can Iplay sound using Clip in Java?” can be representedby the vector Vq = (“plai”, “sound”, “us”, “clip”).

The terms contained in the natural-languagequeries are likely to appear in the training data, whichis created using a large-scale code snippet corpus.Then for each term ti in the normalized query, weextract its term vector representation (i.e., a float

5. word2vec: http://code.google.com/p/word2vec/




vector) from term vectors that are generated on thetraining data. We denote the term vector as ti =(fi1, fi2, ..., fij , ..., fin), where n is the total number ofsemantic features, and fij is the jth semantic feature.

We obtain the vector representation Tq of the nor-malized natural-language query by summing up theterm vectors of each term of the query. That is,Tq =

∑mi=1 ti =

∑mi=1(fi1, fi2, ..., fij , ..., fin), where m

is the total number of terms in the query and n isthe number of semantic features. For example, thequery “How can I play sound using Clip in Java?” isnormalized to (“plai”, “sound”, “us”, “clip”) that hasfour terms. Then we obtain the vector representationof this query by summing up the term vectors of eachof the four terms “plai”, “sound”, “us” and “clip”.

2) Inferring relevant API class-names for thequery. Similarly, we obtain the term vector repre-sentation of each API class-name from term vectorsthat are generated from the training data. We denotethe vector representation of an API class-name asTc = (f1, f2, ..., fi, ..., fn), where n is the number ofsemantic features.

For a natural-language query q, we compute itssemantic distance to each API class-name c. The se-mantic distance is measured by the cosine similar-ity [29] between their vector representations, and canbe computed using Eq. (1).

Dq,c =Tq · Tc

|Tq||Tc|(1)

where Tq is the vector representation of the natural-language query q, and Tc is the vector representation ofthe API class-name c.

Then we rank the API class-names based on theirsemantic distances to the query.

As the more popular an API is, the more widelythe API class is used to accomplish a programmingtask. The popularity of the API class-names is anotherimportant feature to consider [19]. The popularity ofan API class-name is measured by the number ofJava files that use the corresponding API class in thecorpus. Hence, we re-rank the top 10 API class-namesbased on their popularities.

Finally, we select the top k API class-names fromthe resulting ten API class-names to expand the query.After attempting different numbers (i.e., k = 1, 3, 5, 7,10) on a development set of five queries, we decideto set k = 5. The development set is only used forchoosing k, and is not included in the evaluationstep. In addition, the completeness and redundancyof the selected API class-names can be balanced whensetting k = 5. Specifically, using only the top rankedAPI class-name (i.e., k = 1) might be insufficient tohelp improve the search efficiency. However, recom-mending many API class-names (e.g., k = 10) to aquery would increase the chance to include redundantAPI class-names that are not the most relevant to thequery. For the query “How can I play sound using

Clip in Java?” (see Fig. 1), the top 5 selected API class-names are: “Player”, “AudioInputStream”, “LineU-navailableException”, “Game”, and “AudioClip”.

3.4 Executing the Expanded QueriesWe execute the expanded queries using a two-layercode search engine. The two layers are applied fol-lowing the order: the vector space model (VSM) [22]and the weighted-sum ranking schema [12].

1) The vector space model is used to retrieve aninitial set of relevant code examples for expandedqueries. However, VSM is a search engine for gen-eral text retrieval and therefore, is not optimizedfor code search. Solely using the similarity betweenthe code example and the query is insufficient toretrieve relevant code examples [24]. Therefore, basedon the similarity scores of the retrieved code examplesthat are reported by VSM, we further perform theweighted-sum ranking schema.

2) The weighted-sum ranking schema computesthe ranking score of a candidate code example usinga weighted sum of various features of the code exam-ple [12]. In our approach, we consider the followingfive code example features:(F1) fv is the similarity score produced by an off-

the-shelf VSM technique Lucene6 to measure thesimilarity between two texts (i.e., the query andthe code example in our case). Therefore, fvmeasures the text similarity between the queryand the whole body of the code example.

(F2) fs measures the cosine similarity between thequery and the signature of the code example [19].The signature of the code example only includesthe method name and the type of each parame-ter. Thus fs captures the relevance between thequery and the method declaration (i.e., methodname and parameter types).

(F3) fn denotes the sum of the total number of oc-currences that each line of code in the codeexample appears in the corpus [5]. fn describesthe popularity of the code example.

(F4) fp is the number of the parameters of the codeexample, and reflects the amount of informationrequired to execute the code example.

(F5) fa denotes whether the recommended API class-names for the query expansion appear in theparameters of the code example. If the answeris yes, fa is set to -1, otherwise, it is 0.

We compute the ranking score of a code exampleas the weighted sum of its five features, using Eq. (2).

S = wfv + wfs + 2wfn + wfp + wfa (2)

where fv , fs, fn, fp, fa are the five features of a codeexample, and w is the weight for the features.In our implementation, we assign equal weights to

6. Lucene: https://lucene.apache.org/




TABLE 3Summary of our corpus.

Feature Number

Number of projects 24,666Number of Java files 2,882,451Number of code snippets 23,677,216

each feature (except for fn). We set the weight of fea-ture fn twice as large as other weights since rankingcode examples based on feature fn performs betterthan other features [19].

4 CASE STUDY SETUP

In this section, we present the setup of our casestudy to evaluate the performance of our proposedapproach. The case study setup includes our corpus ofcode snippets, the golden set, and baseline approachesfor comparison purpose.

4.1 CorpusA large-scale corpus of code snippets is needed forevaluating the performance of retrieving code ex-amples for a given code search query. To build thecorpus, we download a big inter-project repositoryIJaDataset7 that was created by Keivanloo et al. [19]and later was used by Svajlenko et al. [34] to build abenchmark for clone detection. The dataset contains24,666 open-source projects that are collected fromSourceForge and Google Code. In total, there are2,882,451 unique Java files.

To extract code snippets from Java files, we use thesame approach as the earlier work [26] on code searchand recommendation of Java source code. Specifically,we use a Java syntax parser8 available in Eclipse JDTto extract one code snippet from each single methodin the Java files. Finally, 23,677,216 code snippets areextracted and added into the code snippet corpus.Table 3 summarizes the key descriptive statistics ofthe corpus.

4.2 Golden setA golden set is a set of queries associated with a list ofcode examples that are correctly tagged to be relevantto each query. A golden set is essential to evaluatehow our approach performs in recommending codeexamples for natural-language code search queries.

1) First golden set. The first golden set is obtainedfrom a benchmark that was created for the clonedetection purpose by Svajlenko et al. [34]. In thebenchmark, there are thousands of code snippets fromten distinct functionalities (hereby, target functional-ities) that are frequently used in open-source Java

7. IJaDataset: https://github.com/clonebench/BigCloneBench8. The parser from Eclipse JDT: http://www.eclipse.org/jdt/

1 public static void copyFile(File srcFile, File destFile)2 throws IOException {3 if (!destFile.exists()) {4 destFile.createNewFile();5 }67 FileInputStream fis = new FileInputStream(srcFile);8 FileOutputStream fos = new FileOutputStream(destFile);9

10 FileChannel source = fis.getChannel();11 FileChannel destination = fos.getChannel();1213 destination.transferFrom(source, 0, source.size());1415 source.close();16 destination.close();1718 fis.close();19 fos.close();20 }

Fig. 3. One code example in the first golden set.

projects. The number of target functionalities has beenexpanded to 40 in the new version of the bench-mark dataset9. In the benchmark, each code snippetis tagged whether it correctly implements each targetfunctionality. The tagging process has been done bythree different annotators in their study [34].

For each target functionality, we locate its corre-sponding natural-language code search query fromSO. The benchmark contains the specification of howto implement each target functionality. To find thecorresponding natural-language query for each targetfunctionality from SO, we check whether the acceptedanswer of the SO question exactly matches its speci-fication.

For each target functionality, we consider a codesnippet is truly relevant to the corresponding queryif it is tagged to correctly implement the target func-tionality. For example, one code example in the goldenset, as shown in Fig. 3, correctly implements the targetfunctionality “Copy File”.

Among the 40 target functionalities, we filter theones that do not use APIs for implementing thefunctionality. This is because our approach expandsthe initial queries with API class-names, and it wouldmake no difference on the search performance if APIsare not used. The authors manually reviewed the 40target functionalities in the benchmark [34] and finallyobtained 24 target functionalities. The correspondingqueries and code examples of the 24 target function-alities are then selected to evaluate the performanceof our approach.

2) Second golden set. To evaluate search engines,the minimum number of queries in a golden set is50 [22]. We further created a second golden set using50 code search questions on Stack Overflow. In thesecond golden set, the correct code examples for eachquery are identified using the pooling method [22].The pooling method is the most standard evaluationmethod in information retrieval and was previously

9. Benchmark dataset: https://github.com/XBWer/BigCloneBench




chosen to evaluate the schema for code example re-trieval [2]. In the pooling method, evaluators assessthe relevance of the top k results returned by anumber of different approaches instead of the entirelist of returned results. The pooling method is usu-ally applied if the corpus is very large. Our corpuscontains 23,677,216 code snippets in total; therefore itis impossible to manually examine all retrieved codesnippets for each query.

For each query, we put the top k code snippets thatare retrieved by six studied approaches (see Fig. 4 and6) into a pool, and remove any duplicates. Annotatorsare provided the entire set of code snippets from thepool in a randomized order. Then Annotators inde-pendently label the relevance of each code snippet.We conjecture that, in practice, developers would havechanged their queries if they could not find relevantcode snippets on the first five pages. Most searchengines (e.g., Google) always present only ten resultson each page [2]. So we choose to use the top 50results (i.e., k = 50, five pages with 10 results perpage) to build the pool in our evaluation. Finally, weget a pool with at most 300 code snippets for eachquery.

The creation of the second golden set has been doneby three annotators, who are graduate students withmore than five years’ experience in Java development.We apply Cohen’s kappa coefficient to measure inter-annotator agreement. As we have three annotators, wecompute the pair-wise Cohen’s kappa coefficient andcalculate the average value. In our study, the averageCohen’s kappa coefficient is 0.80, indicating a goodagreement [22]. When there is a disagreement, themajority rule is applied.

We combine the first and the second golden setfor evaluation. The combination contains 74 natural-language code search queries in total. All the 74queries can be found in the replication package.

4.3 Comparison BaselinesIn the text retrieval literature, a variety of approachesfor query expansion have been proposed [4]. Whenapplying in software engineering data, three ap-proaches (i.e., Dice [8], Rocchio [28], and RSV [45])are demonstrated to perform the best [14]. Therefore,we choose the three approaches as our baselines tocompare our approach with.

In the three approaches, the candidate terms areselected from the top k code examples that are rankedby a ranking strategy. Among all candidate terms, thetop n terms are recommended to expand the initialquery. Similar to the study in Haiduc et al. [14], we setk = 5, and n = 10. The details of the three approachesare described as follows.

1) Dice. This approach ranks the terms in the topk code examples based on their Dice similarity to theterms of the initial query. The Dice similarity [8] canbe computed as:

Dice =2dfq∩s

dfq + dfs(3)

where q denotes a term contained in a query; s is a termextracted from the top k code examples; dfq , dfs and dfq∩sdenote the number of code examples in the corpus thatcontain q, s, or both q and s, respectively [8].

2) Rocchio. This approach orders the terms in thetop k code examples using Rocchio’s method [28]. Thescore of each term is computed as the sum of the tf×idf scores of the term in the top k code examples, asshown in Eq. (4).

Rocchio =∑c∈R

tf(t, c)× idf(t, c) (4)

where R is the set of the top k code examples in the resultlist; c is a code example in R; and t is a term in the codeexample c [28].

The term frequency tf(t, c) is the number of timesof term t appearing in code example c. The termfrequency tf(t, c) indicates the importance of term tover other terms in code example c [22]. The inversedocument frequency idf(t, c) indicates the specificityof term t to code example c that contains it. The in-verse document frequency idf(t, c) can be consideredas the inverse of the number of code examples in thecorpus thats contain term t [22].

3) RSV. This approach uses the Robertson SelectionValue (RSV) [45] to compute the ranking score for theterms in the top k code examples, using Eq. (5).

RSV =∑c∈R

tf(t, c)× idf(t, c) [p(t|R)− p(t|C)]

p(t|R) = freq(t ∈ R)/|R|tp(t|C) = freq(t ∈ C)/|C|t

(5)

where C denotes the corpus of code snippets; R is the setof top k code examples; c is a code example in R; and t isa term in code example c; freq(t ∈ R) and freq(t ∈ C)are the number of times that term t appears in R and C,respectively; |R|t and |C|t are the number of terms in Rand C, respectively [45].

4.4 Performance MeasureIn our experiment, we consider three measures of theeffectiveness of each code search approach:

1) Coverage that measures how good an approachis across a set of queries. Coverage of a code searchapproach is defined as the proportion of queries [2],for which there is at least one relevant code snippetamong the top 10 retrieved results by the approach.

2) The rank of the first correct answer that mea-sures how good an approach is for an individualquery. The rank of the first correct answer is theposition of the first correct answer (i.e., truly relevantcode example) in the result list.

3) Mean reciprocal rank (MRR) [39] is a statisticmeasure on the performance of a ranking approach,




Initialquery

Ourapproach

Expandedquery

VSM(Lucene)

Codeexamples

VSM(Lucene)

Codeexamples

WSRCode

examples

(step1)

(step1)

(step2)

Fig. 4. The approach for studying the efficiency of ourapproach in code search (RQ3).

and is widely used in the information retrieval (IR)community. It is defined as the average of the recip-rocal ranks (RR) of the first correct answer, using theequation MRR = 1

|Q|∑|Q|

i=1 RRi, where |Q| is the totalnumber of queries used for evaluation and RRi is thereciprocal rank of the first correct answer for the ithquery (i.e., 1/ranki).

5 CASE STUDY RESULTS

In this section, we discuss the results of our casestudy. For each research question, we describe themotivation, analysis approach, and findings.

RQ3: Does the expanded query work better thanthe natural-language query in code search?Motivation. In RQ2, we confirm the importance ofusing identifiers to retrieve relevant code examples.Then we propose an approach to expand natural-language queries with API class-names that are se-mantically related to the initial query. In this ques-tion, we aim to evaluate whether our approach caneffectively expand natural-language queries towardsincreasing the rank of the first relevant code example.Approach. As our proposed approach contains twoparts (i.e., query expansion and the weighted-sumranking schema), we design our experiment in twosteps. The process is illustrated in Fig. 4.

1) Study the effect of expanding queries. Weexecute both the initial query and expanded query inLucene (i.e., a tool that implements VSM) and retrievecode examples from the corpus of code snippets. Wecompare the search results of the initial and expandedqueries to evaluate the performance of query expan-sion. We calculate coverage and the rank of the firstrelevant code example, and test the following nullhypothesis:

H1a30 : there is no difference between the ranks of the first

relevant code example returned by executing the initialqueries and the ones returned by executing the expandedqueries.

To test the null hypothesis H1a30 , we apply the

Wilcoxon signed-rank test with the 95% confidencelevel (i.e., p-value < 0.05). Wilcoxon signed-rank test

110

100

1000

1000

0

d11 d12 d13

Ran

k of

the

first

cor

rect

cod

e ex

ampl

e

1

10

100

1000

10000

Initial query(VSM)

Expanded query(VSM)

Expanded query(VSM + WSR)

Fig. 5. The beanplots of the rank of the first relevantcode example using the initial queries, the expandedqueries without and with the weighted-sum rankingschema. The overall line depicts the median rank ofthe first relevant code example.

has no assumption about the distribution of the as-sessed variable [30], and it has been commonly ap-plied in the information retrieval (IR) literature tocompare IR measures [38, 46]. If p-value is less than0.05, we reject the null hypothesis H1a

30 and concludethat the search performance of using the initial andexpanded queries is significantly different. We furthercompare the ranks of the first relevant code exampleto find which approach performs better.

The search performance is improved if the MRRvalue increases. To assess if the improvement interms of the MRR value is statistically significant, weconduct the Wilcoxon signed-rank test with the 95%confidence level (i.e., p-value< 0.05) to test the nullhypothesis:H1b

30 : there is no difference in the reciprocal ranks (RR)between using the initial queries and using the expandedqueries.

2) Study the effect of using the weighted-sumranking schema. We compare the search results of theexpanded query with and without the weighted-sumranking schema (i.e., WSR). This helps us understandthe impact of the weighted-sum ranking schema onthe search performance. Similarly, we compute cover-age, the rank of the first relevant code example, andthe MRR values. We test the following null hypothesisusing the Wilcoxon signed-rank test with the 95%confidence level (i.e., p-value < 0.05):H2a

30 : there is no difference between the ranks of the firstrelevant code example returned by executing the expandedquery without the weighted-sum ranking schema and theones returned by executing the expanded query with theweighted-sum ranking schema.H2b

30 : there is no difference in the reciprocal ranks (RR)with and without using the weighted-sum ranking schema.




Findings. The search performance for using the ex-panded query is better than using the initial query.By expanding the query, the coverage of code searchesis increased from 36% to 54% for the 74 queries inthe combination of our two golden sets. Fig. 5 depictsthe rank of the first relevant code example in ourthree settings. The median rank of the first relevantcode example is lifted up from 22 to 7 by expandingqueries. As pointed out by Gottipati et al. [11], thehigher rank a relevant code example has, the morevaluable it is for a developer, because it is more likelyto be examined by a developer. Moreover, the p-valueof the Wilcoxon signed-rank test is 0.03. Therefore, wereject the null hypothesis H1a

30 . In addition, for 58%(i.e., 43 out of 74) of queries, expanding the initialquery yields higher ranks of the first relevant codeexample, and ties occur in seven queries.

The MRR values obtained using the initial queriesand the expanded queries on the entire set of 74queries are 0.20 and 0.29, respectively. The p-valueof the Wilcoxon signed-rank test is 0.03. Hence, wereject the null hypothesis H1b

30 , and conclude that ourquery expansion can significantly improve the searchperformance in terms of the coverage, the rank of thefirst relevant code example, and the MRR value.

The weighted-sum ranking schema can furtherimprove the search performance. With the weighted-sum ranking schema, the coverage of code searches isfurther increased from 54% to 76%, and the medianrank of the first relevant code example is lifted upfrom 7 to 2. Our approach (i.e., query expansionand the weighted-sum ranking schema) can retrieverelevant code examples in the top 2 code snippets for54% (i.e., 40 out of 74) of queries. The p-value of theWilcoxon signed-rank test is 3.56e-03. We reject thenull hypothesis H2a

30 .The MRR value obtained on the entire set of 74

queries is increased from 0.29 to 0.53, when usingthe weighted-sum ranking schema. The p-value ofthe Wilcoxon signed-rank test is 1.18e-04. Hence, wereject the null hypothesis H2b

30 . We conclude that theweighted-sum ranking schema can further signifi-cantly improve the search performance in terms of thecoverage, the rank of the first relevant code example,and the MRR value.

In summary, with query expansion, relevant codeexamples of 50% of the 74 studied queries can beretrieved in the top 7 results which usually appearon the first result page and are likely to be examinedby developers. The weighted-sum ranking schemafurther saves the effort of developers by potentiallyproviding relevant code examples in the top 2 resultsand increasing the coverage of code searches.��

��

Our query expansion can significantly improve thesearch performance, and the weighted-sum rankingschema can bring a further improvement.

Initialquery

Ourapproach

Expandedquery

DiceExpanded

query

RocchioExpanded

query

RSVExpanded

query

Codeexamples

Codeexamples

Codeexamples

Codeexamples

Codesearchengine

(VSM+

WSR)

Fig. 6. The approach for comparing the four queryexpansion approaches (RQ4).

RQ4: Does our approach outperform the off-the-shelf query expansion approaches in answeringnatural-language code search queries?

Motivation. As aforementioned, three off-the-shelfapproaches (i.e., Dice [8], Rocchio [28], and RSV [45])perform the best for software engineering data amonga variety of query expansion approaches. To expanda query, the three approaches can generate relatedterms from the initial query. In this research question,we aim to study whether our expansion approachis better than the three popular approaches in thecontext of code search.Approach. To address this question, we implementedthe three query expansion approaches as described inSection 4.3. As presented in Fig. 6, we first expandthe initial query using each of the four approaches.For each of the four expanded queries, we applyit in Lucene (i.e., VSM) to retrieve the top 10 codeexamples. Then we apply the weighted-sum rankingschema (i.e., WSR) to obtain the final rank of the tencode examples for each of the four approaches.

We compare the efficiency of our query expansionapproach and the three approaches. For each off-the-shelf approach, we test the following null hypothesis,using the Wilcoxon signed-rank test with the 95%confidence level (i.e., p-value < 0.05):Ha

40: there is no difference between the ranks of the firstrelevant code example obtained for queries expanded by ourapproach and the ones obtained for queries expanded by theoff-the-shelf approach.If p-value is less than 0.05, we reject the null hypoth-esis Ha

40 and conclude that the efficiency of our queryexpansion approach and the off-the-shelf approach issignificantly different in terms of the rank of the firstrelevant code example. We further compare the ranksof the first relevant code example to find which queryexpansion approach performs better.

To assess if the improvement in the MRR valueis statistically significant, we conduct the Wilcoxonsigned-rank test with the 95% confidence level (i.e.,p-value< 0.05) to test the null hypothesis:




110

100

1000

1000

0

d2dice+d2ours d2rocchio+d2ours.1 d2rsv+d2ours.2

Ran

k of

the

first

cor

rect

cod

e ex

ampl

e

1

10

100

1000

10000

Dice vs. Ours Rocchio vs. Ours RSV vs. Ours

Fig. 7. The beanplots of the rank of the first relevantcode example using the three off-the-shelf approachesand our approach. The overall line depicts the medianrank of the first relevant code example.

Hb40: there is no difference in the reciprocal ranks (RR)

between using the initial queries and using the expandedqueries.Findings. Our approach performs better than thethree off-the-shelf approaches on expanding natural-language queries for code search. The coverageof successful code searches is 76% using our ap-proach, which is greater than the three off-the-shelfapproaches that achieve 59% (Dice), 54% (Rocchio)and 51% (RSV). Fig. 7 presents the comparison resultbetween the three off-the-shelf approaches and ourapproach using the combination of the two goldensets. The grey side represents the performance for thequeries expanded by each of the three off-the-shelfapproaches, while the black side denotes the perfor-mance for the queries expanded by our approach.

The median rank of the first relevant code examplewhen using the queries expanded by our approachis 2. When using the three off-the-shelf approaches,the average rank of the first relevant code example is4.5 (Dice), 9 (Rocchio) and 8.5 (RSV), respectively. Thep-value of the Wilcoxon signed-rank test is 8.69e-03(Dice), 4.62e-03 (Rocchio) and 1.51e-04 (RSV), respec-tively. Therefore, we reject the null hypothesis Ha

40

for each off-the-shelf approach. Comparing to Dice,Rocchio and RSV, our approach results in better ranksof the first relevant code example for 47% (35/74), 50%(37/74), and 54% (40/74) of queries, respectively. Ourapproach exhibits ties with Dice, Rocchio and RSV for30%(22/74), 24%(18/74), and 26%(19/74) of queries,respectively.

The MRR values obtained on the entire set of 74queries for Dice, Rocchio, RSV, and our approach are0.42, 0.36, 0.36, and 0.53, respectively. The p-valuesof the Wilcoxon signed-rank test are 0.03 (comparingto Dice), 4.18e-03 (comparing to Rocchio), and 2.10e-03

(comparing to RSV), respectively. Hence, we reject thenull hypothesis H2b

30 in all cases.In summary, we conclude that our approach outper-

forms the studied off-the-shelf query expansion ap-proaches in expanding natural-language code searchqueries in terms of the coverage, the rank of the firstrelevant code example, and the MRR value.��

��

On query expansion, our approach outperforms thestudied three off-the-shelf approaches in terms of cov-erage and the rank of the first relevant code example.

6 THREATS TO VALIDITY

In this section, we discuss the threats to validity thatmight affect the results of the study by following thecommon guidelines provided by Yin [44].

Threats to construct validity concern whether thesetup and measurement in the study reflect real-world situations. The threats to construct validity ofour study mainly come from the categorization ofStack Overflow questions, the selection of code searchqueries and the creation of the golden sets. For theclassification of the SO questions, we refer to thequestion types defined in the study by Stolee et al.[33]. Although we do not conduct cross validationin the classification process, the result is inline withtheir observation. We create two golden sets for theevaluation. One golden set is created based on thebenchmark data used for clone detection study [34].The other one is created using the pooling method,which was previously used by Bajracharya et al. [2] toevaluate code example retrieval schemes. The creationprocess of the two golden sets has been done bydifferent judges, and the majority rule is appliedwhen there is disagreement on the correctness of codeexamples. The usage of only six approaches is anotherthreat to construct validity, as the pooling methodused in the creation of the second golden set requiresmany approaches. However, we mitigate this threatby judging a large pool of documents.

Threats to internal validity concern the co-factorsthat may affect the experimental results. For ourstudy, the main threats to internal validity are thenumber of recommended API class-names and theclicking analysis on the Koders data. We have trieddifferent numbers of API class-names recommendedto expand the natural-language code search queriesusing a small set of queries that are only used forparameter setting. We find that expanding the querieswith the top five API class-names performs better thanothers, thus we set the number of recommended APIclass-names to five in our experiments. In RQ2, wedecide whether the results returned by Koders answerusers’ queries using clicking analysis. It could beargued that users might just preview the search resultswithout actually clicking into the results. However, asthe result page of Koders shows only the top few lines




of each returned result, we conjecture that users arelikely to click into the result if they find the relevantone to see more details.

Threats to external validity concern the general-ization of the experiment results. The main threatsto external validity of our study are the representa-tiveness of our corpus and code search queries. Ourcorpus contains 23,677,216 code snippets extractedfrom 24,666 open-source projects. We select 74 de-scriptive questions from SO as natural-language codesearch queries to study the search effectiveness of ourapproach. The size of our query set is larger than thesimilar studies on code search and recommendation,e.g., comparing to 22 queries used in [35], and 20queries evaluated in [2]. In addition, the size of ourquery set meets the minimum number of queriesrecommended for search engine evaluation [22].

Threats to conclusion validity concern the rela-tionship between treatment and outcome. We use anon-parametric statistic test (Wilcoxon signed-ranktest) to show statistical significance of the obtainedexperiment results. Wilcoxon signed-rank test doesnot hold assumption on the distribution of data [30].In addition, we consider that a code example containsonly one method. Therefore, our conclusion on theefficiency of our approach is limited to retrieve codeexamples that have only one method.

Threats to reliability validity concern thepossibility of replicating the study. We providethe necessary details needed to replicate our work.Replication package is publicly available:http://www.feng-zhang.com/replications/TSEqueryExpansion.html

7 RELATED WORK

In this section, we review the studies related to codesearch engines and query expansion technology.

7.1 Code Search

To solve programming tasks, developers can retrievecode snippets from code search engines by provid-ing a query. Many approaches for code search andrecommendation have been proposed (e.g., [6, 16, 26,33, 35, 40]). For instance, GitHub and Koders aretwo internet-scale code search engines that behaveas general-purpose search engines (i.e., accepting key-words as the input query).

Besides the keyword-based query, there are vari-ous forms of queries that are specialized for codesearch. Based on the query in the form “Source ob-ject type → Destination object type”, ParseWeb [35]recommends code examples containing the given ob-ject type. Strathcona [16] locates relevant code ex-amples through structural matching. The structuralcontext of the code under development, including themethod name being written and its containing class,is automatically extracted as the query. Wang et al.

[42] propose to utilize dependency conditions amongidentifiers (e.g., class and method names) to retrieverelevant code snippets within a project. Wang et al.[40] further combine topic analysis with dependencyconditions to improve the accuracy of code search.Stolee et al. [33] design a semantic-based code searchsystem that aims to generate specific functions tomeet developers’ specifications. Developers need tospecify their query as precisely as possible, usinga combination of keywords (e.g., class or methodsignatures) and dynamic specifications (e.g., test casesand security constraints). Mishne et al. [26] proposeto use a sequence of API calls as queries to retrievepartial code snippets.

Different from aforementioned approaches, we usethe general form of query (i.e., a query described innatural language), but propose to expand the querywith semantically related API class-names.

7.2 Query Expansion

Query expansion has long been an effective way toimprove the performance of text retrieval (TR) en-gines [28]. The query expansion approaches belongto two major categories: interactive approaches andautomatic approaches [14].

Interactive approach is based on relevance feed-back and uses the “best” terms from the results ofthe initial query to augment the query for a furtherretrieval. The relevance feedback can require the in-volvement of users (i.e., explicit relevance feedback,ERF) or can be done automatically (i.e., pseudo/blindrelevance feedback, PRF). ERF requires users to pro-vide his/her judgements explicitly on the relevancesof the returned documents to the query (e.g., [9, 15,41]). Although ERF can provide the most accuratefeedback, its use in source code retrieval is highly lim-ited as it demands users’ numerous efforts in markingthe relevant and irrelevant documents. On the otherhand, PRF does not require user involvement. PRFuses the top set of the retrieved documents as anapproximation to the set of relevant documents (e.g.,[14]). Therefore, it can only work when the initialquery is reasonably strong (i.e., at least some of therelevant documents can be retrieved).

Automatic approach is based on the automatedidentification of words that are semantically relatedto the initial query. Sridhara et al. [32] report thatthe words that are semantically related in Englishmight be not semantically related in software prod-ucts. Marcus et al. [23] use Latent Semantic Indexing(LSI) to determine the words that are more likely toco-occur in source code, and use the information of co-occurrence to recommend words that are semanticallyrelated to the queries. Ge et al. [10] implement pre-recommendation and post-recommendation for initialqueries mainly by identifying frequently co-occurringterms techniques and synonym terms, respectively.




Yang and Tan [43] infer semantically related wordsthrough a pairwise comparison between the codesegments and the comments in the source code. Tianet al. [36] propose to apply WordNet to find seman-tically related words. Similarly, Lu et al. [21] expandqueries based on word similarity learned from Word-Net. Moreover, Sisman and Kak [31] use positionalproximity to identify the terms that are “close” to theinitial query terms in the source code.

In this paper, we propose to use only API class-names for query expansion. The semantically relatedAPI class-names of each query are automatically iden-tified by a neural network natural language model(specifically CBOW [25]).

8 CONCLUSION

A common practice of developers to solve program-ming problems is to ask questions on online forums(e.g., Stack Overflow). We observe that questions aim-ing for code examples are prevalent on Stack Over-flow. We consider such questions as natural-languagequeries to be fed into code search engines. It canrelieve developers’ programming burden to a greatextent, if code search engines can return relevantcode examples. However, code search engines usuallyperform well for queries with full identifiers, but notfor natural-language queries that may not containidentifiers. Therefore, it is of significant interest toexpand natural-language queries using related iden-tifiers.

To this end, we propose an automated approachthat identifies semantically related API class-names toa given natural-language query. First, we apply thecontinuous bag-of-words model (CBOW) [25] to learnvector representations of API class-names. Second, wecompute the semantic distance between the initialquery and API class-names using the learned termvectors. Then, we select the most semantically relatedAPI class-names to expand the initial query.

Our approach is evaluated using 74 natural-language queries. We measure the search efficiencyusing the coverage and the rank of the first desiredcode example in the retrieved list of code exam-ples. The result of the case study shows that usingour expanded queries are significantly better thanusing the initial queries. Using the weighted-sumranking schema further improves the coverage ofcode searches. We also compare our query expan-sion approach with three popular query expansionapproaches (i.e., Dice [8], Rocchio [28], and RSV [45])that have been demonstrated to perform the best forsoftware engineering data [14]. The results show thatour approach performs better than the three queryexpansion approaches. In summary, it is a promisingdirection to expand natural-language queries withAPI class-names.

In future, we plan to integrate our approach with astate-of-the-art code search engine to effectively rec-ommend code examples for natural-language queries.It would be fruitful for existing code search enginesto adopt our approach to enhance their support ofnatural-language queries. Moreover, we plan to re-trieve more complex code examples that span severalmethods.

REFERENCES

[1] S. K. Bajracharya and C. V. Lopes. Analyzing andmining a code search engine usage log. EmpiricalSoftware Engineering, 17(4-5):424–466, 2012.

[2] S. K. Bajracharya, J. Ossher, and C. V. Lopes.Leveraging usage similarity for effective retrievalof examples in code repositories. In 8th ACMSIGSOFT international symposium on Foundationsof software engineering, pages 157–166, 2010.

[3] M. G. Bulmer. Principles of Statistics. CourierCorporation, 1967.

[4] C. Carpineto and G. Romano. A survey of auto-matic query expansion in information retrieval.ACM Computing Surveys, 44:1–56, 2012.

[5] S. Chatterjee, S. Juvekar, and K. Sen. Sniff: Asearch engine for java using free-form queries.In 12th International Conference on FundamentalApproaches to Software Engineering, pages 385–400,2009.

[6] R. Cox. Regular expression matching with a tri-gram index or how google code search worked.In Online resource: http://swtch.com/ rsc/regexp/reg-exp4.html, 2012.

[7] CreativeResearchSystems. Sample Size Calcula-tor. http://www.surveysystem.com/sscalc.htm,2012. [Online; accessed 18-Feburary-2016].

[8] L. R. Dice. Measures of the amount of ecologicassociation between species. Ecology, 26:297–302,2012.

[9] G. Gay, S. Haiduc, A. Marcus, and T. Menzies. Onthe use of relevance feedback in IR-based conceptlocation. In International Conference on SoftwareMaintenance, pages 997–1016, 2009.

[10] X. Ge, D. C. Shepherd, K. Damevski, andE. R. Murphy-Hill. How developers use multi-recommendation system in local code search.In International conference on Visual Language andHuman-Centric Computing, pages 69–76, 2014.

[11] S. Gottipati, D. Lo, and J. Jiang. Finding relevantanswers in software forums. In 26th Interna-tional Conference on Automated Software Engineer-ing, pages 323–332, 2011.

[12] M. Grechanik, C. Fu, Q. Xie, C. McMillan,D. Poshyvanyk, and C. Cumby. Exemplar: A

http://www.surveysystem.com/sscalc.htm




source code search engine for finding highly rel-evant applications. IEEE Transactions on SoftwareEngineering, 38:1069–1087, 2012.

[13] P. E. Greenwood and M. S. Nikulin. A Guide toChi-Squared Testing. John Wiley & Sons, Inc., 1996.

[14] S. Haiduc, G. bavota, A. Marcus, R. Oliveto,A. D. Lucia, and T. menzies. Automatic queryreformulations for textual retireval in softwareengineeing. In International Conference on SoftwareEngineering, pages 842–851, 2013.

[15] J. H. Hayes, A. Dekhtyar, and S. K. Sundaram.Advancing candidate link generation for require-ments tracing: The study of methods. IEEETransactions on Software Engineering, 32:4–19, 2006.

[16] R. Holmes and G. C. Murphy. Using structuralcontext to recommend source code examples. In27th International Conference on Software Engineer-ing, pages 117–125, 2005.

[17] Thorsten Joachims. Optimizing search enginesusing clickthrough data. In Proceedings of theEighth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD ’02,pages 133–142, 2002.

[18] Thorsten Joachims, Laura Granka, Bing Pan,Helene Hembrooke, Filip Radlinski, and GeriGay. Evaluating the accuracy of implicit feedbackfrom clicks and query reformulations in websearch. ACM Transactions on Information Systems,25(2), April 2007.

[19] I. Keivanloo, J. Rilling, and Y. Zou. Spottingworking code examples. In 36th InternationalConference on Software Engineering, pages 664–675,2014.

[20] A. Kuhn, S. Ducasse, and T. Girba. Seman-tic clustering: Identifying topics in source code.Information and Software Technology, 49:230–243,2007.

[21] Meili Lu, X. Sun, S. Wang, D. Lo, and YucongDuan. Query expansion via wordnet for effectivecode search. In Software Analysis, Evolution andReengineering (SANER), 2015 IEEE 22nd Interna-tional Conference on, pages 545–549, March 2015.

[22] Christopher D. Manning, Prabhakar Raghavan,and Hinrich Schütze. Introduction to InformationRetrieval. Cambridge University Press, New York,NY, USA, 2008.

[23] A. Marcus, A. Sergeyev, V. Rajlich, and J. I.Maletic. An information retrieval approach toconcept location in source code. In WorkingConference on Reverse Engineering, pages 214–223,2004.

[24] C. McMillan, D. Poshyvanyk, M. Grechanik,Q. Xie, and Chen Fu. Portfolio: Searching forrelevant functions and their usages in millionsof lines of code. ACM Transactions on SoftwareEngineering and Methodology, 22, 2013.

[25] T. Mikolov, K. Chen, G. Corrado, and J. Dean.Efficient estimation of word representations in

vector space. In Workshop at ICLR, 2013.[26] A. Mishne, S. Shoham, and E. Yahav. Typestate-

based semantic code search over partial pro-grams. In ACM international conference on Objectoriented programming systems languages and appli-cations, pages 997–1016, 2012.

[27] Doug Newcomb. The next big os war is in yourdashboard. Wired, 2012.

[28] J. J. Rocchio. The smart retrieval system - exper-iments in automatic document processing. Rele-vance feedback in information retrieval, pages 313–323, 1971.

[29] G. Salton, A. Wong, and C. S. Yang. A vectorspace model for automatic indexing. In Commu-nications of the ACM, pages 613–620, 1975.

[30] D. J. Sheskin. Handbook of Parametric and Non-parametric Statistical Procedures. Chapman andHall/CRC, 2007.

[31] B. Sisman and A. C. Kak. Assisting code searchwith automatic query reformulation for bug lo-calization. In 10th IEEE Working Conference onMining Software Repositories, pages 309–318, 2013.

[32] G. Sridhara, E. Hill, L. Pollock, and K. V. Shanker.Identifying word relations in software: A com-parative study of semantic similarity tools. InInternational Conference on Program Comprehension,pages 123–132, 2008.

[33] K. T. Stolee, S. Elbaum, and D. Dobos. Solvingthe search for source code. ACM Transactions onSoftware Engineering and Methodology, 23, 2014.

[34] J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy,and M. M. Mia. Towards a big data curatedbenchmark of inter-project code clones. In 30thInternational Conference on Software Maintenanceand Evolution, 2014.

[35] S. Thummalapenta and T. Xie. Parseweb: Aprogrammer assistant for reusing open sourcecode on the web. In 22nd International Conferenceon Automated Software Engineering, pages 204–213,2007.

[36] Yuan Tian, D. Lo, and J. Lawall. Automated con-struction of a software-specific word similaritydatabase. In Software Maintenance, Reengineeringand Reverse Engineering (CSMR-WCRE), 2014 Soft-ware Evolution Week - IEEE Conference on, pages44–53, Feb 2014.

[37] TIOBE. TIOBE Index for March 2016. http://www.tiobe.com/tiobe_index, 2016. TIOBE Soft-ware BV. [Online; accessed 22-March-2016].

[38] Julián Urbano, Mónica Marrero, and DiegoMartín. A comparison of the optimality of sta-tistical significance tests for information retrievalevaluation. In Proceedings of the 36th InternationalACM SIGIR Conference on Research and Develop-ment in Information Retrieval, SIGIR ’13, pages925–928, 2013.

[39] Ellen M. Voorhees. The trec-8 question answeringtrack report. In Proceedings of the 8th Text REtrieval

http://www.tiobe.com/tiobe_index

http://www.tiobe.com/tiobe_index




Conference, TREC ’99, pages 77–82, 1999.[40] S. Wang, D. Lo, and L. Jiang. Code search via

topic-enriched dependence graph matching. InReverse Engineering (WCRE), 2011 18th WorkingConference on, pages 119–123, Oct 2011. doi:10.1109/WCRE.2011.69.

[41] Shaowei Wang, David Lo, and Lingxiao Jiang.Active code search: Incorporating user feedbackto improve code search relevance. In Proceedingsof the 29th ACM/IEEE International Conference onAutomated Software Engineering, ASE ’14, pages677–682, New York, NY, USA, 2014. ACM. ISBN978-1-4503-3013-8. doi: 10.1145/2642937.2642947.

[42] Xiaoyin Wang, David Lo, Jiefeng Cheng,Lu Zhang, Hong Mei, and Jeffrey Xu Yu.Matching dependence-related queries in thesystem dependence graph. In Proceedings of theIEEE/ACM International Conference on AutomatedSoftware Engineering, ASE ’10, pages 457–466,2010.

[43] J. Yang and L. Tan. Inferring semantically relatedwords from software context. In IEEE WorkingConference on Mining Software Repositories (MSR),pages 161–170, 2012.

[44] Robert K. Yin. Case Study Research: Design andMethods - Third Edition. SAGE Publications, 3edition, 2002.

[45] Hugo Zaragoza, Nick Craswell, Michael J. Taylor,Suchi Saria, and Stephen E. Robertson. Microsoftcambridge at TREC 13: Web and hard tracks. InProceedings of the Thirteenth Text REtrieval Confer-ence, TREC 2004, 2004.

[46] Justin Zobel. How reliable are the results oflarge-scale information retrieval experiments? InProceedings of the 21st Annual International ACMSIGIR Conference on Research and Development inInformation Retrieval, SIGIR ’98, pages 307–314,1998.

Feng Zhang is currently a postdoctoral re-search fellow with the Department of Elec-trical and Computer Engineering at Queen’sUniversity in Canada. He obtained his PhDdegree in Computer Science from Queen’sUniversity in 2016. His research interestsinclude empirical software engineering, min-ing software repositories, software analytics.His research has been published at severaltop-tier software engineering venues, suchas the IEEE Transactions on Software En-

gineering (TSE), the International Conference on Software Engi-neering (ICSE), and the Springer Journal of Empirical SoftwareEngineering (EMSE). More about Feng and his work is available athttp://www.feng-zhang.com

Haoran Niu received the BEng degree fromHarbin Institute of Technology, China, in2013. She is currently working towards theM.A.Sc degree in the Department of Elec-trical and Computer Engineering at Queen’sUniversity, Canada. Her research interestsare code search, code recommendation,andtheir applications in mobile application devel-opment.

Iman Keivanloo received his Ph.D degreein 2013 from Concordia University, Canada.He is currently a Post Doctoral Fellow inthe Department of Electrical and ComputerEngineering at Queen’s University. His re-search interests include source code simi-larity search, clone detection, source coderecommendation, and their applications forsoftware evolution and maintenance.

Ying Zou is the Canada Research Chairin Software Evolution. She is an associateprofessor in the Department of Electrical andComputer Engineering, and cross-appointedto the School of Computing at Queen’s Uni-versity in Canada. She is a visiting sci-entist of IBM Centers for Advanced Stud-ies, IBM Canada. Her research interests in-clude software engineering, software reengi-neering, software reverse engineering, soft-ware maintenance, and service-oriented ar-

chitecture. More about Ying and her work is available online athttp://post.queensu.ca/~zouy

ieeee transactions on software engineering, vol. x,...

Documents