stack overflow slides data analytics
TRANSCRIPT
![Page 1: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/1.jpg)
What StackOverflow Tells Us About Programming Languages
Rahul Thankachan Nada Aldarrab Prathmesh Gat
University of Southern California
1
![Page 2: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/2.jpg)
2
Agenda
![Page 3: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/3.jpg)
Agenda
1. Introduction2. Problem Statement3. Temporal Based Trend Analysis4. Topic Analysis5. Predicting Time To Answer6. Summary and Q&A
3
![Page 4: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/4.jpg)
4
Introduction
![Page 5: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/5.jpg)
Introduction
• Dataset Used : Stackoverflow - Internet Archive.• Why? It is one of the largest developer focused
open collaborative platform currently.• Through our study we intend to answers some
interesting questions• Study the rise and fall of popular programming
languages• Can be used to predict future enhancements• Study effectiveness of Stack Overflow model
5
![Page 6: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/6.jpg)
Problem Definition
• Basic Analysis: What are the most popular programming languages?
• What are the trends in programming languages?• What are the most popular topics discussed in a
programming language?• Can we accurately predict the time it takes until a
questioner gets an answer?
6
![Page 7: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/7.jpg)
Related WorkMiltiadis Allamanis and Charles Sutton. 2013. Why, When and What: Analyzing Stack Overflow Questions by Topic, Type & Code In 10th Working Conference on Mining Software Repositories. Mining Challenge. IEEE, pages 53-56.
• Topic modeling analysis• Used Latent Dirichlet Allocation (LDA)• Modeled Java Topics of Questions • Can evaluate the orthogonality of different languages • Stack Overflow questions are about the code and are not application domain specific
7
![Page 8: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/8.jpg)
Related WorkV. Bhat, A. Gokhale, R. Jadhav, J. Pudipeddi, and L. Akoglu. Min (e) d your tags: Analysis of question response time in stackoverflow. In Proceedings of ASONAM 2014, pages 328–335. IEEE, 2014.
• Two linear classifiers: logistic regression and SVM with linear kernel • Two non- linear classifiers: decision tree (DT) and SVM with radial basis function kernel
8
![Page 9: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/9.jpg)
Related WorkPrediction Accuracy:
9
![Page 10: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/10.jpg)
10
Basic Analysis &Temporal Trends
![Page 11: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/11.jpg)
StackOverFlow Activity
11
Approx. 45% of all programming languages in world are discussed on StackOverflow!
![Page 12: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/12.jpg)
Post Type
12
![Page 13: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/13.jpg)
Top Ten Languages
13
![Page 14: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/14.jpg)
Question Count
14
![Page 15: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/15.jpg)
Answer Count
15
![Page 16: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/16.jpg)
Answer Fraction
16
![Page 17: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/17.jpg)
Question Fraction
17
![Page 18: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/18.jpg)
Questioners/Answerers Distribution
18
![Page 19: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/19.jpg)
19
Topic Analysis
![Page 20: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/20.jpg)
Topic Analysis
20
![Page 21: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/21.jpg)
21
Predicting time until Answer
![Page 22: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/22.jpg)
Approach
Following attributes selected for study:1. Tag (Only top 10 programming languages)
2. Creation Month
3. Body Length
4. Tag Length
5. Introduced new Nominal class - Time_Answer 6. (less6, bet6and20, 20andmore)
22
![Page 23: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/23.jpg)
Approach
ToolsWeka - Weka is a collection of machine learning
algorithms for data mining tasks.
Data Preprocessing:Challenging!
23
![Page 24: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/24.jpg)
Approach(Data Pre -processing)
1. Parse all the answers and link first answer’s creation
time to creation time of question. We called this
field delta-answer.
2. Remove all the Questions which had delta answer
negative or zero
3. We developed a Python script which develops .arff
file On the fly (Wish to contribute this file)
24
![Page 25: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/25.jpg)
Evaluation
• Subset Size: 4490947 - Subset - 449000• Classify response time into 3 types: less than 6
minutes, between 6 and 20 minutes, 20 minutes and more.
• 10-fold cross-validation• Results are obtained using different feature
combinations and different classifiers
25
![Page 26: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/26.jpg)
Evaluation
Results of classifier J48 (all Attributes)
26
![Page 27: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/27.jpg)
Evaluation
27
Results of classifier (body_length/ tag_length)
![Page 28: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/28.jpg)
Summary
• We were successfully able to find interesting temporal trends for major programming languages
• Using tag based topic analysis we were able to find major discussion topics and to some extent the difficult topics in a programming language
• Using machine learning techniques we were successfully able to predict - time to answer with good accuracy
28
![Page 29: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/29.jpg)
Future Scope of Work
• Contribute the .arff on the fly generator script.• Adding Parts of speech as an attribute• Showcasing the results on a website
29
![Page 30: Stack Overflow slides Data Analytics](https://reader031.vdocuments.mx/reader031/viewer/2022021919/587a26a01a28abbd388b525f/html5/thumbnails/30.jpg)
Questions?
30