api linking in informal technical...

API Linking in Informal Technical DiscussionCHENG CHEN (U5969643)

SUPERVISED BY ZHENCHANG XING

Australia National University / Data61

BackgroundIn programming discussion platforms (e.g, Stack Overflow, Twitter), many APIs are mentioned in natural language contexts, for example:

•How to apply pos_tag_sents() to pandas dataframeefficiently?

•The documentation describes support for apply() method,but it doesn't accept any arguments

challenges1. Common-word polysemy:

E.g. 55.04% of the Pandas’s APIs have common-word simple name, such as :

◦ Series : A class’s name ◦ apply: A method’s name

Therefore, in this sentence:“ I want to apply a function with arguments to series in pythonpandas. The documentation describes support for apply method, but it doesn't accept any arguments. “

Hard to recongnize which “apply” is the general verb or the function name os Pandas

challenges2. Sentence-format variations

Sentence-context variations

I have finally decided to use apply which I understand is more flexible.

if you run apply on a series the series is passed as a np.array

It is being run on each row of a Pandas DataFrame viathe apply function

Can not simply develop a complete set of regular expressions or island grammar checker to recongnize API mentions.

challenges3. the variety of API mention Forms

a) the variety of API forms are caused by different presentation style, coding style, abbreviations method and context

b) some accidentally factor like misspelling, inconsistent annotations and space

Idea of solutions1. some character level features are commonly share in many API mentions

Matplotlib.pyplot.savefig(path = ‘’)

Case sensitivity Module names brackets Parameter list

Idea of solutions2. The sentence context information is helpful to distinguish API mentions and non-API words, like verbs, some nouns, and jargons of software programming

Some example posts:

It is being run on each row of a Pandas DataFrame via the apply function

if you run apply on a series the series is passed as a np.array

Tokenizer

Stack OverflowDiscussion

DataSet DNN ModelTraining

Raw data

Manuallylabelled

data

Workflow

DataSelecter

TransferLearning

Preparation of the training data

1. The tokenizer is developed based on a Twitter tokenizer with the special rules to keep API mention structure.

general English tokenizer’s output:

Matplotlib.Pyplot.

Imshow()

“matplotlib.pyplot.imshow()”

This work’s tokenizer output:

2. Manually labelled 1500 posts (including 3722 API mentions, over 5000 sentence)

Neural Network Structure

Convolutional Layer

input data

Char-levelfeature Sentence

Level FeatureBi-LSTM

Layer

Classifier ofAPI mention

Max PoolingLayer

WordEmbedding


1. The Convolutional Layer and Max pooling layer is used for capture the character levelfeature

S a v e t x t

Max pooling, keep the most important value

The matrix represent theword base on char vector


2. Binary – directional Long short term memory layer is used for learning the sentenceLevel information

Forward input sequence

you can use apply() method

you can use apply() method

Backward input sequence

Concatenate the both outputAs the abstract matrix of the inputsentence

Feasibility of Transfer learning

1. The training data set are generated from the Python library :Numpy, Pandas and Matplotlib.

2. The data are sharing some character level and sentence features, have the potential of

3. The Convolutional Layer capture character level feature, and Bi-lstm layer learn sentence level layer, the weight of eachlayer is separately sharable.

4. The weights of Neural network are loaded and freezen acrossdifferent training tasks

Evaluation Methods

precision : the positive prediction of the retrieved result

recall: the percentage of retrieved positive result of total positive result

F1 score is defined as:

Performance of API extraction

Matplotlib Numpy Pandas

Word based model 75.71 72.81 81.53

Char based model 75.42 71.27 77.45

Deep model 78.98 76.30 84.60

The F1 score of API extraction result

The generally performance improvement is 4%

Results of Transfer learning for Numpy data

45

50

55

60

65

70

25% 50% 100%

randomly initialized

load model trained on Pands data

load model trained on Matplotlib data

Datasize

F1 score

Results of Transfer learning for Pandas data

60

65

70

75

80

85

25% 50% 100%

randomly initialized

load model trained on Numpy data

load model trained on Matplotlib data

Datasize

F1 score

Conclusion

1. Our work get acceptable result on API mention linking tasks

2. Transfer learning generally improve the model’s performance

3. Transfer learning method improve the Neural network training,and the improvement is more obvious when the training data set become smaller.

4. The Bi-lstm weight have less transfer potential than the lower CNN layer, cause it influenced by the output of CNN layer

References:•A. Bacchelli, M. Lanza, and R. Robbes, “Linking e-mails and source

code artifacts,” in Proceedings of the 32nd ACM/IEEE InternationalConference on Software Engineering-Volume 1. ACM, 2010, pp. 375–384

•Q. Gao, H. Zhang, J. Wang, Y. Xiong, L. Zhang, and H. Mei, “Fixingrecurring crash bugs via analyzing q&a sites (t),” in Automated SoftwareEngineering (ASE), 2015 30th IEEE/ACM International Conference on.IEEE, 2015, pp. 307–318

•P. Liang, “Semi-supervised learning for natural language,” Ph.D. dissertation, Citeseer, 2005.

• J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random

• fields: Probabilistic models for segmenting and labeling sequence data,”

• in Proceedings of the Eighteenth International Conference on Machine

•Learning, ser. ICML ’01, 2001, pp. 282–289.Y. Yao and A. Sun, “Mobile phone name extraction from internet forums: a semi-supervised approach,” World Wide Web, pp. 1–23, 2015

api linking in informal technical...

Documents