api linking in informal technical...
TRANSCRIPT
-
API Linking in Informal Technical DiscussionCHENG CHEN (U5969643)
SUPERVISED BY ZHENCHANG XING
Australia National University / Data61
-
BackgroundIn programming discussion platforms (e.g, Stack Overflow, Twitter), many APIs are mentioned in natural language contexts, for example:
•How to apply pos_tag_sents() to pandas dataframeefficiently?
•The documentation describes support for apply() method,but it doesn't accept any arguments
-
challenges1. Common-word polysemy:
E.g. 55.04% of the Pandas’s APIs have common-word simple name, such as :
◦ Series : A class’s name ◦ apply: A method’s name
Therefore, in this sentence:“ I want to apply a function with arguments to series in pythonpandas. The documentation describes support for apply method, but it doesn't accept any arguments. “
Hard to recongnize which “apply” is the general verb or the function name os Pandas
-
challenges2. Sentence-format variations
Sentence-context variations
I have finally decided to use apply which I understand is more flexible.
if you run apply on a series the series is passed as a np.array
It is being run on each row of a Pandas DataFrame viathe apply function
Can not simply develop a complete set of regular expressions or island grammar checker to recongnize API mentions.
-
challenges3. the variety of API mention Forms
a) the variety of API forms are caused by different presentation style, coding style, abbreviations method and context
b) some accidentally factor like misspelling, inconsistent annotations and space
-
Idea of solutions1. some character level features are commonly share in many API mentions
Matplotlib.pyplot.savefig(path = ‘’)
Case sensitivity Module names brackets Parameter list
-
Idea of solutions2. The sentence context information is helpful to distinguish API mentions and non-API words, like verbs, some nouns, and jargons of software programming
Some example posts:
It is being run on each row of a Pandas DataFrame via the apply function
if you run apply on a series the series is passed as a np.array
-
Tokenizer
Stack OverflowDiscussion
DataSet DNN ModelTraining
Raw data
Manuallylabelled
data
Workflow
DataSelecter
TransferLearning
-
Preparation of the training data
1. The tokenizer is developed based on a Twitter tokenizer with the special rules to keep API mention structure.
general English tokenizer’s output:
Matplotlib.Pyplot.
Imshow()
“matplotlib.pyplot.imshow()”
This work’s tokenizer output:
2. Manually labelled 1500 posts (including 3722 API mentions, over 5000 sentence)
-
Neural Network Structure
Convolutional Layer
input data
Char-levelfeature Sentence
Level FeatureBi-LSTM
Layer
Classifier ofAPI mention
Max PoolingLayer
WordEmbedding
-
Neural Network Structure
1. The Convolutional Layer and Max pooling layer is used for capture the character levelfeature
S a v e t x t
Max pooling, keep the most important value
The matrix represent theword base on char vector
-
Neural Network Structure
2. Binary – directional Long short term memory layer is used for learning the sentenceLevel information
Forward input sequence
you can use apply() method
you can use apply() method
Backward input sequence
Concatenate the both outputAs the abstract matrix of the inputsentence
-
Feasibility of Transfer learning
1. The training data set are generated from the Python library :Numpy, Pandas and Matplotlib.
2. The data are sharing some character level and sentence features, have the potential of
3. The Convolutional Layer capture character level feature, and Bi-lstm layer learn sentence level layer, the weight of eachlayer is separately sharable.
4. The weights of Neural network are loaded and freezen acrossdifferent training tasks
-
Evaluation Methods
precision : the positive prediction of the retrieved result
recall: the percentage of retrieved positive result of total positive result
F1 score is defined as:
-
Performance of API extraction
Matplotlib Numpy Pandas
Word based model 75.71 72.81 81.53
Char based model 75.42 71.27 77.45
Deep model 78.98 76.30 84.60
The F1 score of API extraction result
The generally performance improvement is 4%
-
Results of Transfer learning for Numpy data
45
50
55
60
65
70
25% 50% 100%
randomly initialized
load model trained on Pands data
load model trained on Matplotlib data
Datasize
F1 score
-
Results of Transfer learning for Pandas data
60
65
70
75
80
85
25% 50% 100%
randomly initialized
load model trained on Numpy data
load model trained on Matplotlib data
Datasize
F1 score
-
Conclusion
1. Our work get acceptable result on API mention linking tasks
2. Transfer learning generally improve the model’s performance
3. Transfer learning method improve the Neural network training,and the improvement is more obvious when the training data set become smaller.
4. The Bi-lstm weight have less transfer potential than the lower CNN layer, cause it influenced by the output of CNN layer
-
References:•A. Bacchelli, M. Lanza, and R. Robbes, “Linking e-mails and source
code artifacts,” in Proceedings of the 32nd ACM/IEEE InternationalConference on Software Engineering-Volume 1. ACM, 2010, pp. 375–384
•Q. Gao, H. Zhang, J. Wang, Y. Xiong, L. Zhang, and H. Mei, “Fixingrecurring crash bugs via analyzing q&a sites (t),” in Automated SoftwareEngineering (ASE), 2015 30th IEEE/ACM International Conference on.IEEE, 2015, pp. 307–318
•P. Liang, “Semi-supervised learning for natural language,” Ph.D. dissertation, Citeseer, 2005.
• J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random
• fields: Probabilistic models for segmenting and labeling sequence data,”
• in Proceedings of the Eighteenth International Conference on Machine
•Learning, ser. ICML ’01, 2001, pp. 282–289.Y. Yao and A. Sun, “Mobile phone name extraction from internet forums: a semi-supervised approach,” World Wide Web, pp. 1–23, 2015