api linking in informal technical...

19
API Linking in Informal Technical Discussion CHENG CHEN (U5969643) SUPERVISED BY ZHENCHANG XING Australia National University / Data61

Upload: others

Post on 19-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

  • API Linking in Informal Technical DiscussionCHENG CHEN (U5969643)

    SUPERVISED BY ZHENCHANG XING

    Australia National University / Data61

  • BackgroundIn programming discussion platforms (e.g, Stack Overflow, Twitter), many APIs are mentioned in natural language contexts, for example:

    •How to apply pos_tag_sents() to pandas dataframeefficiently?

    •The documentation describes support for apply() method,but it doesn't accept any arguments

  • challenges1. Common-word polysemy:

    E.g. 55.04% of the Pandas’s APIs have common-word simple name, such as :

    ◦ Series : A class’s name ◦ apply: A method’s name

    Therefore, in this sentence:“ I want to apply a function with arguments to series in pythonpandas. The documentation describes support for apply method, but it doesn't accept any arguments. “

    Hard to recongnize which “apply” is the general verb or the function name os Pandas

  • challenges2. Sentence-format variations

    Sentence-context variations

    I have finally decided to use apply which I understand is more flexible.

    if you run apply on a series the series is passed as a np.array

    It is being run on each row of a Pandas DataFrame viathe apply function

    Can not simply develop a complete set of regular expressions or island grammar checker to recongnize API mentions.

  • challenges3. the variety of API mention Forms

    a) the variety of API forms are caused by different presentation style, coding style, abbreviations method and context

    b) some accidentally factor like misspelling, inconsistent annotations and space

  • Idea of solutions1. some character level features are commonly share in many API mentions

    Matplotlib.pyplot.savefig(path = ‘’)

    Case sensitivity Module names brackets Parameter list

  • Idea of solutions2. The sentence context information is helpful to distinguish API mentions and non-API words, like verbs, some nouns, and jargons of software programming

    Some example posts:

    It is being run on each row of a Pandas DataFrame via the apply function

    if you run apply on a series the series is passed as a np.array

  • Tokenizer

    Stack OverflowDiscussion

    DataSet DNN ModelTraining

    Raw data

    Manuallylabelled

    data

    Workflow

    DataSelecter

    TransferLearning

  • Preparation of the training data

    1. The tokenizer is developed based on a Twitter tokenizer with the special rules to keep API mention structure.

    general English tokenizer’s output:

    Matplotlib.Pyplot.

    Imshow()

    “matplotlib.pyplot.imshow()”

    This work’s tokenizer output:

    2. Manually labelled 1500 posts (including 3722 API mentions, over 5000 sentence)

  • Neural Network Structure

    Convolutional Layer

    input data

    Char-levelfeature Sentence

    Level FeatureBi-LSTM

    Layer

    Classifier ofAPI mention

    Max PoolingLayer

    WordEmbedding

  • Neural Network Structure

    1. The Convolutional Layer and Max pooling layer is used for capture the character levelfeature

    S a v e t x t

    Max pooling, keep the most important value

    The matrix represent theword base on char vector

  • Neural Network Structure

    2. Binary – directional Long short term memory layer is used for learning the sentenceLevel information

    Forward input sequence

    you can use apply() method

    you can use apply() method

    Backward input sequence

    Concatenate the both outputAs the abstract matrix of the inputsentence

  • Feasibility of Transfer learning

    1. The training data set are generated from the Python library :Numpy, Pandas and Matplotlib.

    2. The data are sharing some character level and sentence features, have the potential of

    3. The Convolutional Layer capture character level feature, and Bi-lstm layer learn sentence level layer, the weight of eachlayer is separately sharable.

    4. The weights of Neural network are loaded and freezen acrossdifferent training tasks

  • Evaluation Methods

    precision : the positive prediction of the retrieved result

    recall: the percentage of retrieved positive result of total positive result

    F1 score is defined as:

  • Performance of API extraction

    Matplotlib Numpy Pandas

    Word based model 75.71 72.81 81.53

    Char based model 75.42 71.27 77.45

    Deep model 78.98 76.30 84.60

    The F1 score of API extraction result

    The generally performance improvement is 4%

  • Results of Transfer learning for Numpy data

    45

    50

    55

    60

    65

    70

    25% 50% 100%

    randomly initialized

    load model trained on Pands data

    load model trained on Matplotlib data

    Datasize

    F1 score

  • Results of Transfer learning for Pandas data

    60

    65

    70

    75

    80

    85

    25% 50% 100%

    randomly initialized

    load model trained on Numpy data

    load model trained on Matplotlib data

    Datasize

    F1 score

  • Conclusion

    1. Our work get acceptable result on API mention linking tasks

    2. Transfer learning generally improve the model’s performance

    3. Transfer learning method improve the Neural network training,and the improvement is more obvious when the training data set become smaller.

    4. The Bi-lstm weight have less transfer potential than the lower CNN layer, cause it influenced by the output of CNN layer

  • References:•A. Bacchelli, M. Lanza, and R. Robbes, “Linking e-mails and source

    code artifacts,” in Proceedings of the 32nd ACM/IEEE InternationalConference on Software Engineering-Volume 1. ACM, 2010, pp. 375–384

    •Q. Gao, H. Zhang, J. Wang, Y. Xiong, L. Zhang, and H. Mei, “Fixingrecurring crash bugs via analyzing q&a sites (t),” in Automated SoftwareEngineering (ASE), 2015 30th IEEE/ACM International Conference on.IEEE, 2015, pp. 307–318

    •P. Liang, “Semi-supervised learning for natural language,” Ph.D. dissertation, Citeseer, 2005.

    • J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random

    • fields: Probabilistic models for segmenting and labeling sequence data,”

    • in Proceedings of the Eighteenth International Conference on Machine

    •Learning, ser. ICML ’01, 2001, pp. 282–289.Y. Yao and A. Sun, “Mobile phone name extraction from internet forums: a semi-supervised approach,” World Wide Web, pp. 1–23, 2015