p.d.a.t. (piazza data analysis tool)cse400/cse400_2015_2016/... · 2016-05-10 · data through our...

P.D.A.T. (Piazza Data Analysis Tool) Aashish Lalani, Varun Agarwal

Advisors: Swapneel Sheth, Arvind Bhusnurmath, Benedict Brown

ABSTRACT

Piazza is a platform used by University courses where students can ask questions and get answers. Our project has created generalizable parsing tools to extract class relevant data formatted into statistically usable datasets. It also provides a suite of statistical tools by which this data may be appropriately analyzed. Analysis includes the use of several different distributions including Poisson, Beta-Geometric and Negative Binomial. We modeled behaviors expounded from the extracted data. For example the answering patterns of teaching assistants versus students, newly hired teaching assistants versus older teaching assistants, the effect of question length on answer time, redundant questions by specific students, the small number of students who answer most of the questions and the spike of questions before milestones. We provide relevant information and tools to our project advisors for the betterment of their courses in particular Introduction to Computer Programming. Results show that recruiting practices may be improved by using models trained from past semesters to model TA involvement in later semesters.

I. INTRODUCTION As the field of computer science becomes a more popular choice with college students, the sizes of classes are becoming larger and larger. Between 2012 and 2014 the number of students in Introductory Computer Science at Penn increased from 400 to 600. The number of TAs has been increasing in proportion as well. In such large classes a lot of data is created in the form of class performance, attendance, and participation in Piazza. Piazza is a platform used by university courses where students can ask questions and get answers. Our objective was to research this data and prove or disprove certain hypotheses with the goal of improving class experience for both teachers and students. Piazza had some previous research with a select set of hypotheses such as “Female students are more likely to answer questions anonymously than their male peers”. However the variety of hypotheses was limited[5][6][7]. The main contribution of our research was twofold. First, create an app that can be used in the future to parse through data dumps and second to perform statistical analysis on the datasets we create to show evidence for or against certain hypotheses. For example we studied the spikes of piazza activity before milestones, the effect of length of posts on answering patterns, the concentration of answers coming from a small group of students or TAs or the difference in answering patterns between Old and New TAs. In order to conduct our research we required Piazza data. We ideally needed data from a large class in terms of number of

students and number of posts. Hence we chose CIS 110, the introductory computer science class at Penn because it met both our requirements. We obtained the raw CIS 110 Piazza data through our advisors from Piazza. Most of the results in this paper are based on said data.

II. APPROACH

A. Pipeline

1. Acquire data from Piazza in XML format. Two files should be provided for each class forum, a class_content.xml and and users.xml. The class_content.xml files contains every piece of activity on the forum including questions, answers, follow-ups, timestamps for every post, user interaction for every post, and more. The users.xml will contain information about each user subscribed to the class forum such as number of questions asked, answered, viewed, and more. An example snapshot follows of the two files:

class_content.xml

users.xml

2. The tool can now be initiated through command line and will offer a variety of hypotheses. Each will require either or both of the files listed above and will either produce data sets in CSV formats or produce graphs for analysis immediately depending on whether or not the chosen hypothesis requires more advanced statistics. The tool is implemented as a bash script and the data parsing commands are implemented in python. Data parsing is mainly done using the elementTree library which provides XML parsing and traversal tools [1]. Snapshots of the command line and one of the python scripts follows:

Command Line Interface

Python Script 3. Data was formatted in xml and CSV format. So the

parsers had to be equipped for both. The fileCSVReader.java expects two arguments as files, the first being users.xml which provides a data dump of all users in the participating class and a CSV file which provides further details about the users.

In order to parse the xml file we used the javax.xml.parsers.DocumentBuilder library. We used java.util.Scanner [8][9] to read the CSV file. While the xml file provides data such as the number of posts, answers, questions asked, questions answered, days online, the CSV provided information on whether the individual is a student, TA or professor. The two files are parsed are then merged using name as a common key to produce a file that contains all the data for future use. For certain hypothesis we did not take this approach. The java file dealing with the hypothesis would parse the file individually, searching for keywords or symbols for the purpose of the hypothesis.

4. If your hypothesis requires more advanced statistical analysis, the parser will have produced a CSV file containing the necessary dataset. You are now free to use your own software to analyze it or continue to use our programmable excel sheets. Simply open up the programmable sheet and replace the current data with the data from the CSV file that has just been generated. You must now run the solver tool to optimize the models for the input data. This will require you to install the Data Analysis ToolPak by Microsoft. Now simply run the solver tool (all parameters will have been preset on the sheet) and the charts and graphs will automatically update. A snapshot of the example CSV and the programmable excel sheet follows:

CSV file

Programmable excel sheet

B. Example of Hypotheses

i. Returning TAs answer fewer questions than new TAs ii. Number of questions spikes before class milestone

iii. A subset of students answer most questions iv. A subset of students answer redundant questions v. Length of question influences response time

vi. Inclusion of code snippets influences response time vii. Higher attendance in office hours correlates with

higher grades viii. The same questions are asked each semester

ix. Participation is influenced by grade year x. Woman tend to post anonymously more often than

men xi. Students who answer many questions do well in class

C. Statistical Approach

i. Returning TAs answer fewer questions than new TAs To analyze to answering behavior of new and returning TAs, the answering data for those subsets must be parsed out. Once they are discretized by the parsing tool, programmable excel sheets are used to apply models to each of the data sets and compare. The Gamma-Poisson model was used here for a variety of reasons. This model inherently makes the most sense because of the context of the data. Number of answers for any given person are discrete, they are bounded by 0 and some reasonable upper bound, and the propensity for each student to answer questions is very likely to be different. The key aspect here is to take into account the different propensities and a gamma-poisson mixture does that very nicely [2]. Next the models for each can be compared using chi-sq[2], LRT (Likelihood Ratio Test [2], or something else of the users choice.

The Gamma-Poisson distribution

ii. Number of questions spikes before class milestones The idea here is to simply count the number of questions that are posted within certain timeframes and mark the relevant milestone dates. The data was parsed by examining timestamps of questions and counting each into a separate date bin representing every day in the semester. The data was then plotted using matplotlib [3].

iii. A small group of students answer most questions We studied the stark difference and inequality in number of questions being answered. The first approach was simply to go through the created database and create a hash map of key value pairs with users and number of answers. We then calculated the mean, standard deviation etc. A pattern was then found that the top 20 class participants were answering almost 80% of the questions. This inspired us to study

different aspects of this phenomenon with several datasets and with different models. The first approach was to study all the members of the class on a Lorentz curve. The way we went about this approach was to first sort the data, with lowest number of answers to highest. Then we calculated the total number of answers and divided them into quintiles, 5 in this case. We calculated total number of answers coming from each quintile. Then we calculated the total percentages of answers from each quintile. Then we calculated the cumulative percentages for the same. Graphing this information we provide a line of equality whereas one can see the ideal state where each user answers equal number of questions versus the real state where only a small fraction of people are answering a large fraction of answers. The Gini index is a measure of inequality. In order to calculate the Gini index we calculated the area under Lorentz curve and line of equality and in order to calculate Gini index difference in area by area under line of equality.

iv. A subset of student answer redundant questions We conducted a similar study on a smaller dataset where we only studied answers of questions, which were repeats. We parsed data searching for “@” symbol which indicates the answer is pointing to another question. Then we found the corresponding uid and created a histogram to study the patterns of answers for questions that were redundant

v. Length of question influences response time This was accomplished by looking at each question’s length, time of posting, and time of first response. The amount of time it took to respond to a question determined which bin it would be placed in. The bins would be of increasing intervals of time and would be cumulative so as to see when close to 100% of questions under or over a certain length would be answered.

III. RESULTS AND MEASUREMENTS

i. Returning TAs answer fewer questions than new TAs First the models were created and graphed next to the actual values to see how well the models fit. The graphs for new hires and returning TAs follow:

A chart of the model parameters and chisq values that test how well the models fit follows:

r relays level of homogeneity alpha relays levels of magnitude chi-sq relays fit to actual data p-val shows significance of fit (high is good) The two models overlayed now follows:

ii. Number of Questions spike before class milestones The matplot lib graph follows with annotations for certain class milestones

iii. A subset of students answer most questions

A pattern was then found that the top 20 class participants were answering almost 80% of the questions. Gini Index was found to be 0.287892. We found that variance is 523 and standard deviation is 22[4].

iv. A subset of student answer redundant questions Spikes for participation for repeat questions were found.

Histogram where each * represents an answer to a repeat question

v. Length of question influences response time A graph detailing the number of questions answered within an interval of time from posting follows:

IV. ETHICAL AND PRIVACY CONSIDERATIONS To perform research obtaining data is an obstacle. Class attendance, performance and student data leads to privacy concerns. IRB approval is required to obtain and present research data. Researchers must store the data securely. It could be released into public domains by mistake or intentionally. Data transfer between researchers is also an area of possible leak of data and precaution must be taken. In order to use and present the more sensitive data that we planned to work with such as grading data and office hours attendance data, we were required to acquire IRB (Internal Review Board) approval. Before we could acquire such approval, we, along with our advisors, had to go through a mandatory online training session where we were informed about many privacy concerns. We had to file the HS-ERA (Human Subjects-Electronic Research Application) and submit it to the IRB before moving forward with those steps. All such data was also sanitized by the professors that provided us with it such that anonymity was preserved.

V. DISCUSSION

i. Returning TAs answer fewer questions than new TAs When we separate the TAs based on new hires and returning TAs we see that new hires have a mean closer to the middle of the distribution while returning TAs spike at 0 implying new hires tended to answer more questions on average even if they did not necessaily answer the most questions (that is, new hires showed a peak somewhere closer to the middle of the bins). This implies that there may be some benefit in engaging returning TAs differently, possibly by giving them more advanced tasks on piazza or even moving them to higher level courses when their rate of piazza interaction goes down.

ii. Number of Questions spike before class milestones It is easy to see that the class milestones such as homework due dates and the midterm exam date correlate with spikes in

the number of questions posted in the days prior. While this does not guarantee causation, it still implies that our current way of assigning piazza duty to TAs is inefficient. Currently 110 TAs are assigned by week with no regard for expected business of piazza that week leading to some TAs answering significantly more questions than others. A better way might be to assign TAs by day or half-weeks such that a larger number can be assigned to those days that are predicted to be much busier.

iii. Few people answer most questions As can be seen from the graph, a very small group of people answered a large percentage of the questions. Suggestions to improve this include encouraging students to participate by giving points for piazza participation. This could be controversial so perhaps bonus points could be an option. Maybe certain number of answers gives certain number of bonus points

iv. Redundant Questions We can see that just a small group of people repeatedly direct students to old answers. While again participation must be increased perhaps the piazza interface should be improved for making it easier for students to see old posts.

v. Length of question influences response time From the data it is easy to see that the majority of questions are answered within 6 hours of posting but this is not necessarily true under certain parameters. While this allows students to have a better idea of how early they should post their questions, it does not necessarily apply to certain types of posts (very long posts, posts with code snippets, etc). It is possible to add these and other parameters to the parsing tool to determine their effect on question response time. VI. LIMITATIONS AND FUTURE WORK We could not obtain all data sources such as class performance data. That data combined with piazza data could lead to interesting conclusions. Working on hypotheses, which incorporate many different types of data sources, would be revealing. The app could be used to study different classes, which use piazza as a way of studying and improving class experience for both students and teachers. Working with the IRB to get our more sensitive data approved for use was very troublesome. Although we were in contact with many people from the IRB, it was rare that we would get any responses about the status of our applications. Although we all completed our training and were prepared to submit our forms, the lack of time kept us from being able to effectively present grading data in our statistics however this does present many opportunities for future work.

REFERENCES [1] 19.7. xml.etree.ElementTree — The ElementTree XML

API¶. (n.d.). Retrieved April 25, 2016, from https://docs.python.org/2/library/xml.etree.elementtree.html Python Software Foundation.

[2] Bradlow, E., Fader, P., Adrian, M., & Mcshane, B. B. (n.d.). Count Models Based on Weibull Interarrival Times. SSRN Electronic Journal SSRN Journal. doi:10.2139/ssrn.729886.

[3] S. (n.d.). Matplotlib. Retrieved April 25, 2016, from http://matplotlib.org/.

[4] Bourne, M. (2010, February 24). The Gini Coefficient of wealth distribution. Retrieved April 25, 2016, from http://www.intmath.com/blog/mathematics/the-gini-coefficient-of-wealth-distribution-4187

[5] Offit, E. (2015, May 22). Change on the horizon for Computer and Information Science department. Retrieved April 25, 2016, from http://www.thedp.com/article/2015/05/growth-of-the-computer-science-department

[6] Offit, E. (2014, October 30). For computer science

classes, growing enrollment but the same accommodations. Retrieved April 25, 2016, from http://www.thedp.com/article/2014/10/computer-science-enrollment

[7] STEM Confidence Gap. (2015, January 6). Retrieved

April 25, 2016, from http://blog.piazza.com/stem-confidence-gap/

[8] DocumentBuilder (Java Platform SE 7 ). (n.d.). Retrieved April 25, 2016, from https://docs.oracle.com/javase/7/docs/api/javax/xml/parsers/ ocumentBuilder.html [8] Scanner (Java Platform SE 7 ). (n.d.). Retrieved April 25, 2016, from https://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html

p.d.a.t. (piazza data analysis tool)cse400/cse400_2015_2016/... · 2016-05-10 · data through our...

Documents