ppt open mrs 1

OpenMRS Gsoc 2010

First presentation

Shivashankar S

Project : Bug Analytics

Mentors : Dr. Diederik van Liere, David Eaves

About myself

Education : MS by Research in CSE

Grad school : Indian Institute of Technology Madras, Chennai, India

Research Area : Text Mining, Machine Learning

https://sites.google.com/site/shivasclub/home

Roadmap

• Overview

• Deliverables

• Duplicate report identification

• Expert identification

• Current progress & Challenges

• Results

Project Overview

• Fixing bugs faster is very crucial to keep any opensource project active and alive.

• Higher level flow : When a new report comes, thereport will be assigned to an expert, then the expertresolves the bug and fixes it.

• Issue 1 : Organizations involve a triaging team to doassignment of reports manually to experts.

• Issue 2 : Also if a report is resolved as duplicate, thetime spent on the task by the expert goes in vain.

• The aim of “Bug Analytics” project is to address theabove mentioned issues using the text in reports.

• Bug tracking tool of choice is JIRA.

Deliverables

• Plug-in for JIRA that can do the following

– Duplicate ticket identification

– Automatically assigning reports to experts

– Classifying a report as bug or not

– Likelihood of a bug report being fixed

Note : Since all tasks are dealt with individually in literature, the focus will be on tasks in the same order mentioned [last two depends on time availability].

Duplicate report identificationSemi-automated approach Automated approach

Predict top “K” similar reports for each report and leave it to the administrator to call it as duplicate or not

Fix a threshold for similarity and call a report as duplicate if its similarity with any of the reports in DB exceeds the threshold

Pros : No false alarms, also the similar reports returned can be used to improve the report description, analyzing similar bugs and come up with a fix etc.,Cons : More human intervention compared to automated approach.

Pros : Lesser human interventionCons : False alarms

Reference : Lyndon Hiew, Gail C. Murphy. Assisted Detection of Duplicate Bug Reports. Submitted to FSE 2006

Reference : Per Runeson , Magnus Alexandersson , Oskar Nyholm, Detection of Duplicate Defect Reports Using Natural Language Processing, Proceedings of the 29th international conference on Software Engineering, p.499-510, May 20-26, 2007

We should decide the final approach based on the experimental results

http://www.cs.ubc.ca/labs/spl/projects/bugTriage/papers/fse2006.pdf

http://www.cs.ubc.ca/labs/spl/projects/bugTriage/papers/fse2006.pdf

http://portal.acm.org/citation.cfm?id=1248882&dl=GUIDE&coll=GUIDE&CFID=92342584&CFTOKEN=51002441










Expert identificationSemi-automated approach Automated approach

Predict top “K” experts for each report and leave it to the administrator or the top “K” experts themselves to assign the report to one.

Assign exactly to one expert.

Pros : Even if one expert is busy, others can take it up. Cons : Some protocol or mechanism must be put in place to assign one from “K” experts

Pros : If the prediction is good, leads to zero manual effortCons : If incase the prediction was not correct or if the assigned person is overloaded, then it requires manual triaging

References :[1] Anvik, J., Hiew, L., and Murphy, G. C. 2006. Who should fix this bug?, In Proc. ICSE[2] Anvik, J. and Murphy, G. C. 2007. Determining Implementation Expertise fromBug Reports, In Proc. Fourth international Workshop on Mining Software Repositories

We should decide the final approach based on the experimental results

Training set

• Duplicate report prediction : This task needs a validation set to fix threshold in the case of automated approach, and for evaluation in both automated and semi-automated case. We built it using resolution field (resolution = “Duplicate”)

• Expert prediction : Here training set creation is not straight forward. Since “assigned-to” field is not the exact indicator of the expert for a report.

– So a bunch of heuristics are employed, as given in http://www.cs.ubc.ca/labs/spl/projects/bugTriage/assignment/heuristics.html for other projects. It is given in the following slide.

http://www.cs.ubc.ca/labs/spl/projects/bugTriage/assignment/heuristics.html

http://www.cs.ubc.ca/labs/spl/projects/bugTriage/assignment/heuristics.html

Report Status ?

Open, Reopened

Resolved, Closed

Resolution

Patch submission OR activity

as comments

No

Yes

Fixed

Duplicate

Won’t fix, Incomplete, Cannot reproduce

Add other patch submitters ,commenter, and owner as additional

labels

If it is assigned, label the report with owner, else

discard the report

Add the resolver as primary expert

Use the labels of the duplicated

report

Add the person who has submitted most number

of patches (ELSE) has commented most times as

primary expert

Current Progress • Working code which does duplicate report

identification and expert identification with resultscomparable to state of art approaches.

• Next step will be improvising the results with closeranalysis and working on the plug-in for JIRA.

• Noisy text – short forms, spelling mistakes

• Using stack traces, logs information properly

• Similar terms usage, as it is not necessary for everyone to use same word every time.

Challenges

Duplicate identification

• Here TF, TF-IDF vectors are constructed using Summary, Description, Comments text. This is referred to as SDC and SD (with and without comments)

• In the semi-automated approach top “K” similar reports are returned for each report.

– Presence of the actual duplicate report in top “K” is considered as a hit.

– The results are plotted for “K” Vs hits ratio.

– From the results, TF-IDF on SD has the best results.

Semi-automated approach

Automated approach

• A report that has similarity greater than threshold with any other report in the DB is flagged as duplicate.

• Else called as unique.

• For those reports flagged as duplicate, top “K” similar reports greater than threshold are examined.

– If the actual duplicated report is present in top “K” it’s a hit for duplicate case

• On the other hand, if a report is correctly classified as unique, it’s a hit for unique case

• Plots are drawn for “threshold Vs hit ratio” for both duplicates and unique cases

Automated approach – SDC

Automated approach – SD

Expert classification

• The methods used are the following

– Maximum Likelihood based prediction using BRKNN(Binary relevance KNN)

– Maximum A posteriori prediction (MAP) usingBRKNN

– Component wise Maximum Likelihood basedprediction using BRKNN

– Component wise MAP using BRKNN (best results)

• For smaller “K” and smaller number of expertsreturned, the precision is high.

• Other way, for larger “K” and larger number of expertsreturned, the recall is high.

Precision value for 1 expert

Recall value for returning 1 expert

Precision value for returning 2 experts

Recall value for returning 2 experts

Precision value for returning 3 experts

Recall value for returning 3 experts