using natural language processing on large github data

23
- John Alexander and Harshitha Chidananda CS273: Data and Knowledge bases UCSB, Fall 2016

Upload: harshitha-chidananda-murthy

Post on 16-Apr-2017

459 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Using Natural Language Processing on large Github data

- John Alexander and Harshitha ChidanandaCS273: Data and Knowledge bases

UCSB, Fall 2016

Page 2: Using Natural Language Processing on large Github data

Outline

● Github

● Knowledge base

● Data

● Goals

● Vector Space Model

● Topic Modeling

● Gitsmart Demo

● Conclusion

Page 4: Using Natural Language Processing on large Github data

Our knowledge baseThe aim of the project is to use existing data in large well-developed GitHub repositories to build a knowledge base that can assist in the continued development of the project and in more quickly resolving issues posted to the repository.

Problem:

Currently, issues and pull requests for large projects are manually curated.

Proposed solution:

By providing a queryable, largely unsupervised means of tracking input from developers and users, we can significantly improve the efficiency of project curators.

Goals:

- Use Natural Language Processing to get meaningful data from text in the repository- Find similar issues- Recommend people who could work on the issues in the repository based on their previous work- Draw relationship between issues, pull requests and users

Page 5: Using Natural Language Processing on large Github data

Data

Page 6: Using Natural Language Processing on large Github data
Page 7: Using Natural Language Processing on large Github data
Page 8: Using Natural Language Processing on large Github data

Vector Space Modeling

Page 9: Using Natural Language Processing on large Github data

Vector Space Model- Also called as ‘term vector

model’ or ‘vector processing model’

- Represents documents by term sets

- Compares global similarities between queries and documents used in information filtering, information retrieval, indexing and relevancy rankings

Page 10: Using Natural Language Processing on large Github data

1. Github issues data from GithubAPI

Organization: Facebook

Repo: rockdb

Specifically: Number, title and body

Page 11: Using Natural Language Processing on large Github data

2. Preprocess data

Remove stop words

Remove punctuation

Page 12: Using Natural Language Processing on large Github data

3. Vector Space Model

Vectorize

Cosine similarity

Page 13: Using Natural Language Processing on large Github data

4. Similar issues found

(examples)Support for range skips in compaction filterSupport for EventListeners in the Java API' u'Rocksdb compaction error

rocksdb crashed at rocksdb::InternalStats::DumpCFStatsRocksDB shouldn't determine at build time whether to use jemalloc / tcmalloc

Option to expand range tombstones in db_bench' u'allow_os_buffer option'Collapse range deletions for faster range scans

Range deletions unsupported in tailing iteratorCollapse range deletions for faster range scans

EnvPosixTestWithParam should wait for all threads to finishtest failure: ./db_test2 - all threads in pthread_cond_wait

Fix purging non-obsolete log filesWith 4.x LOG files include frequent options Dump (leads to large LOG files)

Page 14: Using Natural Language Processing on large Github data

Topic Modeling

Page 15: Using Natural Language Processing on large Github data

Topic ModelingTopic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body

Latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan, 2003)

● Most popular form of topic modelling● Views each document as a distribution of topics● Views each topic as a distribution of words● Over many iterations, algorithm infers most likely distributions

Page 16: Using Natural Language Processing on large Github data

Our Approach

● Get issues for a single GitHub repository● Each issue and its comments is a document● Run LDA to produce topics

○ Mallet library

Page 17: Using Natural Language Processing on large Github data

User Matching

Assumptions:

● Users specialize in different areas within a repository

● These specializations are reflected in the topics

Page 18: Using Natural Language Processing on large Github data

User Matching

● For each user:○ Get the comments associated with the user○ Apply the topic model to these comments○ Receive topic distribution for that user

Page 19: Using Natural Language Processing on large Github data

User Matching

● For a query:○ Apply topic model to the query○ Dot product query topic vector with each

user topic vector○ Multiply by log(#user issues)

Page 20: Using Natural Language Processing on large Github data

DEMO

Page 21: Using Natural Language Processing on large Github data

Further Development

● Improve stopwords● Add stemming to remove word ambiguity● Improve weighting based on total number of issues

Page 22: Using Natural Language Processing on large Github data

Conclusion- Github API is very useful and has lot of useful data- No ground truth to compare with- The application from demo could be used to notify the user about the issues

they can solve- Issues will get solved faster- Users don’t have to search for the issues they can work on

- Grouping similar issues, issue recommender

Page 23: Using Natural Language Processing on large Github data

Thank YouQuestions?

CS273: Data and Knowledge bases