using natural language processing on large github data
TRANSCRIPT
- John Alexander and Harshitha ChidanandaCS273: Data and Knowledge bases
UCSB, Fall 2016
Outline
● Github
● Knowledge base
● Data
● Goals
● Vector Space Model
● Topic Modeling
● Gitsmart Demo
● Conclusion
GithubGitHub is a web-based Git repository hosting service.
It offers
- distributed version control - source code management (SCM) functionality of Git
It provides:
- access control - bug tracking- feature requests- task management
Our knowledge baseThe aim of the project is to use existing data in large well-developed GitHub repositories to build a knowledge base that can assist in the continued development of the project and in more quickly resolving issues posted to the repository.
Problem:
Currently, issues and pull requests for large projects are manually curated.
Proposed solution:
By providing a queryable, largely unsupervised means of tracking input from developers and users, we can significantly improve the efficiency of project curators.
Goals:
- Use Natural Language Processing to get meaningful data from text in the repository- Find similar issues- Recommend people who could work on the issues in the repository based on their previous work- Draw relationship between issues, pull requests and users
Data
Vector Space Modeling
Vector Space Model- Also called as ‘term vector
model’ or ‘vector processing model’
- Represents documents by term sets
- Compares global similarities between queries and documents used in information filtering, information retrieval, indexing and relevancy rankings
1. Github issues data from GithubAPI
Organization: Facebook
Repo: rockdb
Specifically: Number, title and body
2. Preprocess data
Remove stop words
Remove punctuation
3. Vector Space Model
Vectorize
Cosine similarity
4. Similar issues found
(examples)Support for range skips in compaction filterSupport for EventListeners in the Java API' u'Rocksdb compaction error
rocksdb crashed at rocksdb::InternalStats::DumpCFStatsRocksDB shouldn't determine at build time whether to use jemalloc / tcmalloc
Option to expand range tombstones in db_bench' u'allow_os_buffer option'Collapse range deletions for faster range scans
Range deletions unsupported in tailing iteratorCollapse range deletions for faster range scans
EnvPosixTestWithParam should wait for all threads to finishtest failure: ./db_test2 - all threads in pthread_cond_wait
Fix purging non-obsolete log filesWith 4.x LOG files include frequent options Dump (leads to large LOG files)
Topic Modeling
Topic ModelingTopic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body
Latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan, 2003)
● Most popular form of topic modelling● Views each document as a distribution of topics● Views each topic as a distribution of words● Over many iterations, algorithm infers most likely distributions
Our Approach
● Get issues for a single GitHub repository● Each issue and its comments is a document● Run LDA to produce topics
○ Mallet library
User Matching
Assumptions:
● Users specialize in different areas within a repository
● These specializations are reflected in the topics
User Matching
● For each user:○ Get the comments associated with the user○ Apply the topic model to these comments○ Receive topic distribution for that user
User Matching
● For a query:○ Apply topic model to the query○ Dot product query topic vector with each
user topic vector○ Multiply by log(#user issues)
DEMO
Further Development
● Improve stopwords● Add stemming to remove word ambiguity● Improve weighting based on total number of issues
Conclusion- Github API is very useful and has lot of useful data- No ground truth to compare with- The application from demo could be used to notify the user about the issues
they can solve- Issues will get solved faster- Users don’t have to search for the issues they can work on
- Grouping similar issues, issue recommender
Thank YouQuestions?
CS273: Data and Knowledge bases