figure 1. screenshot of typical student view in 6.002x ...1 − θ · x3a where cj = u(x a+ej) for j...

1
0 2 4 6 8 20 40 0 60 80 100 120 Midterm Score vs Time Watched Hours Watched Score 2013 2014 Badge No Badge Trendlines 2013 2014 Badge No Badge Trendlines 0 2 4 6 8 10 12 14 Hours Watched Final Score vs Time Watched Score 20 40 60 80 100 120 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Question Number Number of Students 0 20 40 60 80 100 120 140 No Credit Partial Credit Full Credit 0 5 10 15 20 25 30 Number of A 1 actions 0.0 0.2 0.4 0.6 0.8 1.0 Probability A 1 A 2 A 3 Date Student Activity Types Registration Exam YouTube Twitter 0 500 Jan 1 Course Start Mar 1 May 1 Final Exam 1000 1500 Number of events/N(day) Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Week 7 Week 8 Week 9 Week 10 Week 11 Week 12 Week 13 Week 14 5 10 15 20 25 0 Midterm Final Homework Lab Lecture Question Number of events/N(day) Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Week 7 Week 8 Week 9 Week 10 Week 11 Week 12 Week 13 Week 14 5 10 15 20 25 0 Midterm Final Tutorial Wiki Discussion Lecture Video Book 2012-03-03 2012-03-09 2012-03-15 2012-03-21 2012-03-27 2012-04-02 2012-04-08 2012-04-14 2012-04-20 2012-04-26 2012-05-02 2012-05-08 2012-05-14 2012-05-20 2012-05-26 2012-06-01 2012-06-07 2012-06-13 0 1,500,000 3,000,000 4,500,000 6,000,000 Number of observed events Legend Students Top 5 Countries 2013 U.S. (33.4%) India (6.8%) U.K. (4.6%) Canada (3.7%) Netherlands (3.5%) 2014 U.S. (39.5%) India (8.7%) U.K. (5.1%) Canada (3.3%) Spain (2.7%) 1 390 780 Year 2014 2013 253 students did not identify a country in 2013 110 students did not identify a country in 2014 US IN CN RU DE PL BR 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% lecture tutorial informational problem exam wiki profile book other US IN CN RU DE PL BR 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% lecture exam wiki forum index home other Lecture Question Lecture Video %N – Certificate earners accessing > %R – Resources 20 40 60 80 100 0 0 20 40 60 80 100 %R – Percentage of Resources Accessed Homework Lab Book Lecture Video Lecture Question Tutorial 0 10 20 %N Certificate earners %R Resources 30 40 50 60 70 80 90 100 0 1 2 3 4 0 10 20 %N Certificate earners %R Resources 30 40 50 60 70 80 90 100 0 3 6 9 12 76% of students accessed > 20% of videos 33% of students accessed > 80% of videos 0 2 4 6 8 10 12 14 Thread length 0 2 4 6 8 10 12 # distinct contributors ML1 ML2 ML3 PGM1 PGM2 PGM3 -60 -40 -20 0 20 40 60 Number of days relative to badge win 0 2 4 6 8 10 12 14 Number of actions per day Electorate Qs As Q-votes A-votes Data Manager Data Miner Designer Nodes Direct communication via Twitter Group membership Edges Visualization Expert Librarian Programmer Project Manager Usability Expert No Information (a) Homework (b) Midterm Exam (c) Final Exam Discussion Lab Lecture Video Lecture Question Book Tutorial Wiki Homework Discussion Lab Book Lecture Video Lecture Question Wiki Tutorial Homework Lab Discussion Lecture Video Book Lecture Question Wiki Tutorial TEI with John Walsh pt 2 ] TEI with John Walsh pt 1 ] Plotly Demo ] Exporting Networks From Gephi with Seadragon (Zoom.it) ] Evolving Networks with Gephi ] Introduction ] The Making of AcademyScope ] Color Perception and Reproduction ] Deployment ] Hans Rosling's Gapminder ] Dynamics ] Exemplary Visualizations ] Welcome ] Weekly Tip: How to Use Property Files ] Bipartite Networks: Mapping CTSA Centers ] Directed Networks: Paper−Citation Network ] Co−Occurrence Networks: NSF Co−Investigators ] Introduction ] Exhibit Map with Andrea Scharnhorst ] Error and Attack Tolerance ] Backbone Identification ] Clustering ] Workflow Design ] Overview and Terminology ] Exemplary Visualizations ] Welcome ] Weekly Tip: Create your own TreeML (XML) Files ] Visualizing Directory Structures ] Introduction ] Algorithm Comparison ] Workflow Design ] Overview and Terminology ] Exemplary Visualizations ] Welcome ] Weekly Tip: Removing Files from the Data Manager ] Word Co−Occurrence Networks with Sci2 ] Mapping Topics and Topic Bursts in PNAS ] Introduction ] Comparison of Text and Linkage−Based Approaches ] Design and Update of a Classifcation System: The UCSD Map of Science ] Workflow Design ] Overview and Terminology ] Exemplary Visualizations ] Welcome ] Weekly Tip: Memory Allocation ] Geocoding NSF Funding with the Generic Geocoder ] Congressional District Geocoder ] Choropleth and Proportional Symbol Map ] Introduction ] Color ] Workflow Design ] Overview and Terminology ] Exemplary Visualizations ] Welcome ] Weekly Tip: Sci2 Log Files ] Burst Detection in Publication Titles ] Temporal Bar Graph: NSF Funding Profiles ] Introduction ] Burst Detection ] Workflow Design ] Overview and Terminology ] Exemplary Visualizations ] Welcome ] Weekly Tip: Extending Sci2 by adding Plugins ] Legend Creation with Inkscape ] Download, Install, and Visualize Data with Sci2 ] Introduction ] Workflow Design ] Visualization Framework ] Course Overview ] Welcome ] Week 3 Week 2 Week 1 IVMOOC Video Views 0 Video Types Theory 2013 Hands−On 2013 Theory 2014 Hands-On 2014 Exemplary Visualizations ] Welcome ] Weekly Tip: Sci2 Log Files ] Burst Detection in Publication Titles ] Temporal Bar Graph: NSF Funding Profiles ] Introduction ] Burst Detection ] Workflow Design ] Overview and Terminology ] Exemplary Visualizations ] Welcome ] Weekly Tip: Extending Sci2 by adding Plugins ] Legend Creation with Inkscape ] Download, Install, and Visualize Data with Sci2 ] Introduction ] Workflow Design ] Visualization Framework ] Course Overview ] Welcome ] Week 2 Week 1 1 3 1 0 log (t[hours]) (a) Number of participants spending log(t) time 1 min 10 min 1 hr 10 hr 100 hr 500 hr 500 1,000 1,500 Browsers Attempted > 5% HW Attempted > 15% HW Attempted > 25% HW Certificate earners Attempted > 25% HW and > 25% Midterm 2 4 MOOC Visual Analytics: Empowering Students, Teachers, Researchers, and Platform Developers of Massively Open Online Courses Analysis Types vs. User Needs Abstract Massively open online courses (MOOCs) offer instructors the opportunity to reach stu- dents in orders of magnitude greater than they could in traditional classroom settings, while offering students access to free or inexpensive courses taught by world-class ed- ucators. However, MOOCs provide major challenges to teachers (keeping track of thou- sands of students and supporting their learning progress), students (keeping track of course materials and effectively interacting with teachers and fellow students), research- ers (understanding how students interact with materials and each other), and MOOC platform developers (supporting effective course design and delivery in a scalable way). Along with these challenges, the sheer volume of data available from MOOCs provides unprecedented opportunities to study how learning takes place in online courses. This paper explores the use of data analysis and visualization as a means to empower teach- ers, students, researchers, and platform developers by making large volumes of data easy to understand. First, we introduce the insight needs of these four user groups. Sec - ond, we review existing MOOC visual analytics studies. Third, we present a framework for MOOC data types and data analyses to support different types of insight needs. Fourth, we present exemplary data visualizations that make data accessible and empow- er teachers, students, developers, and researchers with novel insights. The outlook dis- cusses future MOOC opportunities and challenges. 1. Statistics Line graphs, correlation graphs, and box-and-whisker plots are all exam- ples of how statistical data can be rendered visually. 3. Geospatial Geospatial data might be examined at different levels of aggregation: by address, city, country, or IP address. 2. Temporal Temporal analyses and visualizations tell when students are active over the span of a course. Data might be examined at different levels of aggre- gation: by minute, hour, day, week, or semester, by course modules, or be- fore and after a midterm or final. 4. Topical Topical analysis provides an answer to the question of “what” is going on in a course. 5. Network Student cohorts might be created based on prior expertise, geospatial region or time zone, access patterns, project teams, or grades. A. Student Students taking MOOCs need to be extremely or - ganized and disciplined. MOOCs have no weekly in-class teacher encounters. MOOCs also have much less peer-pressure. Courses might have vastly different schedules, activities such as labs and capstone projects, deadlines, and grading rubrics. Effectively using one or more MOOC platforms can itself be a major learning exercise. Plus, many students are not used to collaborating with students from different disciplinary back - grounds and cultures that speak different native languages and live in different time zones. D. Platform Developer Platform developers need to design systems that support effective course design, efficient teach- ing, and secure but scalable course delivery. They need to support times of high traffic and re- source consumption and schedule maintenances during low activity times. C. Researcher Researchers that study human learning are keen to understand what teaching and learning meth- ods work well in a MOOC environment and now have massive amounts of detailed data with which to work. As all student interactions—with learning materials, teachers, and other students—are re- corded in a MOOC, human learning can be studied at an extreme level of detail. Many MOOC teach- ers double as learning researchers as they are in- terested to make their own MOOC course work for different types of students. B. Teacher Teachers of MOOCs need effective means to keep track of and guide the activities, progress, and any problems encountered by thousands of students. They need to understand the ef - fectiveness of materials, exercises, and exams with respect to learning goals in order to contin- uously improve course schedules, activities, and grading rubrics. Scott R. Emmons Robert P. Light Katy Börner Data Demographic Data General student demographics, including age, gender, language, education level, and lo- cation. Demographic data is commonly acquired during the registration process, and ad- ditional demographic data can be acquired via feedback surveys. Performance Data Student performance based on graded assessments. This is generally collected from homework, quizzes, and examinations, but it also includes results from pre-course sur - veys designed to examine student knowledge before they take the course. Activity Data How students are using class resources, such as the time and date of watching videos, reading material, turning in homework, taking quizzes, or using the discussion forum. Most platforms break down usage by content and media type (i.e. page views, assign- ment views, textbook views, video views). Path through content via inbound and out - bound links is important for understanding learning trajectories. Feedback Data Student input and feedback. Feedback data allows course providers to learn more about student learning goals and motivation, intended use of course, and content hoped to learn. It data also contains information about what students liked or disliked in terms of course content, structure, grading, and teacher interaction. Data Type Data Field edX Canvas Coursera GCB Demographics Email Gender Age / Birth Year Location Level Education D D D D D DE X X X X DE DE DE D D* DE X X GA X Performance Class Grades Quiz / Question Breakdown Student Breakdown D D D D D D D D DE D D DE Activity Test / Assignment Completion Content Usage Breakdown Path Through Content Time-stamped Activity "Active" Student Count Student Breakdown D DE DE DE D DE D D DE D D D D D DE DE D DE D** GA GA GA X X Feedback Supports Surveys D = Dashboard DE = Data Export GA = Google Analytics *If students choose to take optional demographic survey **If toggled to record Acknowledgements We would like to thank Samuel T. Mills for re-designing the figures in this paper and all 2013 and 2014 IVMOOC students for their feedback and comments, enthusiasm, and support. All R code and all Sci2 Tool workflows are available and documented at cns.iu.edu/2015-MOOCVis.

Upload: others

Post on 02-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: figure 1. Screenshot of typical student view in 6.002x ...1 − θ · x3a where Cj = U(x a+ej) for j =1,2 are the constants we have computed. This results in an optimization problem

0 2 4 6 8

2040

060

8010

012

0 Midterm Score vs Time Watched

Hours Watched

Scor

e

● 20132014BadgeNo BadgeTrendlines

● 20132014BadgeNo BadgeTrendlines

●●●

● ●●

● ●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

0 2 4 6 8 10 12 14

Hours Watched

Final Score vs Time Watched

Scor

e

●●

● ●●

●●●

●●

●●

●●

●●

●● ●●

●●

●●

●● ●

● ●●

●●

● ●●

● ●●●

● ●●

●●

● ●

●●

2040

6080

100

120

0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31Question Number

Num

ber

of S

tude

nts

020

4060

8010

012

014

0

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69Question Number

Midterm Scores by Question Final Scores by Question

Num

ber

of S

tude

nts

020

4060

80

No CreditPartial CreditFull Credit

No CreditPartial CreditFull Credit 0 5 10 15 20 25 30

Number of A1 actions

0.0

0.2

0.4

0.6

0.8

1.0

Pro

babi

lity

A1

A2

A3

Figure 1: User receives the badge after completing 25 A1 ac-tions. Actions A1 increase towards the badge boundary at theexpense of the other site action A2 (shifting effort within thesite) as well as the life-action A3 (increase in site activity).

and then solving for U(a1) = U(xa) we have

U(a1) =θ · x1

a · U(xa+e1)− g(xa,p)

1− θ(x2a + x3

a)

Since we have already computed U(a1 + 1) = U(xa+e1), thisbecomes an optimization problem in 3 variables:

maximizexa

θ · x1a · C − g(xa,p)

1− θ(x2a + x3

a)

subject to xja ≥ 0, j = 1, 2, 3 and

3∑j=1

xja = 1

where we’ve replaced U(xa+e1) with C. In the appendix of theextended version of the paper we show how to solve this problemefficiently. For our purposes, the important point is that the optimaldistribution in state a1 can be computed using the solution of thestate a1 + 1. Since we know xa = p for all states a such thata1 ≥ k, we can use this to compute the optimal xa for all a suchthat a1 = k − 1, and recurse all the way back to a0, thus solvingthe user’s optimization problem in the one-dimensional case.

To illustrate the effects captured by our model, we compute theoptimal policy on a simple illustrative instance. In this instance,we place one threshold badge on action A1 with boundary 25 (i.e.,b = (25, 1)). We then solve the user optimization problem de-fined above and plot the optimal xa as a function of a1 (see Fig-ure 1). The user’s optimal mixing probability gets progressivelymore deflected away from his preferred p as he approaches thebadge boundary. The user also increases his probability on A1

by offloading probability mass from both other action types: heis shifting his effort within the site (moving probability mass fromthe other site action A2) and also increasing participation on thesite overall (moving probability mass from the life-action A3).

Note that unifying participation and shifting site effort in thisway does not necessarily make them equivalent in our framework.The deviation penalty function g(xa,p) could be chosen to penal-ize deviating from user’s preference for the life-action more thandeviating from his preferences for the various site actions.Multiple badges. So far we considered a case where there is onlyone badge on the site. We now show that a similar algorithmcan solve the user’s optimization problem when there are multiplebadges that all target the same dimension.

Let B = {bj = (kj , 1)} for j ∈ {1, . . . ,m} be aset of m badges all on the same action A1, and assume thatk1 < k2 < . . . < km without loss of generality. (If kj = kj′for some j �= j′, then we can consider these two badges as a sin-gle badge with value equal to the sum of their individual values.)

The fact that there are many badge boundaries does not affect thealgorithm; the observation about the utility in all states with thesame number of A1 actions still holds, since all such states areequidistant from all badge boundaries, and thus the optimizationproblem is the same in all of them. The problem thus becomesone-dimensional and solvable in the exact same way as before. Thevalue of each state is “initialized” with

∑b∈B Ib(a)Vb and again

our dynamic programming base case is that in all states a after allthe badge boundaries, the user will choose xa = p. Then the re-gion between the last and second-last badge boundaries is identicalto the one-badge case we solved in the previous section and can besolved analogously. In general, the region between badges j − 1and j is identical to the single badge case with a badge of valueVbj +U(xkj ) In this way, we recurse backwards through the set ofbadges to solve the one-dimensional case with many badges.

Two targeted dimensionsNow we consider the case where different badges target different

types of actions. We start with B = {b1 = (k1, 1), b2 = (k2, 2)},so there are two dimensions with one badge targeting each (again,let n = 2 for convenience).

We begin by observing that only actions on targeted dimensionsaffect the optimization problem in any state, thus the utility val-ues in two states with the same number of A1 actions and A2 ac-tions are the same. Our problem, and corresponding dynamic pro-gramming table, is thus two-dimensional. The badge boundariesa1 = k1 and a2 = k2 split the action space into four regions:

•R: a finite rectangle bounded by the origin and (k1−1, k2−1),•H: an infinite horizontal strip with boundary points (k1, 0) and

(k1, k2 − 1) extending rightward,• V : an infinite vertical strip with boundary points (0, k2) and

(k1 − 1, k2) extending upward, and• Q: a quadrant rooted at (k1, k2).Similarly to before, past all the badge boundaries the user has no

incentive to deviate from p, so xa = p for all states in quadrant Q(those with a1 ≥ k1 and a2 ≥ k2).

Quadrants H and V are then identical to the case of one thresh-old badge in one targeted dimension that we solved above.

Now we are left with the finite rectangle R, which we can di-rectly fill in in order of decreasing coordinate sum since the cellsfurthest from the origin depend on the value of states we alreadyknow from solving quadrants Q, H and V . For every state a ∈ R:

U(xa) = θ

n+1∑j=1

xja · U(xa+ej )− g(xa,p)

Consider a state a in region R that we process in order. Wehave already computed U(xa+e1) and U(xa+e2), so we can fur-ther simplify:

U(xa) =θ · (C1 · x1

a + C2 · x2a)− g(xa,p)

1− θ · x3a

where Cj = U(xa+ej ) for j = 1, 2 are the constants we havecomputed. This results in an optimization problem very similar tothe one we had in the one-dimensional case:

maximizexa

θ · (C1 · x1a + C2 · x2

a)− g(xa,p)

1− θ · x3a

subject to xja ≥ 0, j = 1, 2, 3 and

3∑j=1

xja = 1

Date

Stud

ent

��

��

��

��

��

��

���

��

���

��

��

��

��

��

��

� �

��

��

��

��

��

��

��

� �

��

� �

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

���

������������

��

��

��

��

��

��

��

� �

��

��

��

� ��

��

��

��

��

��

��

��

��

��

��

��

��

��

� �

��

��

��

��

��

��

��

� �

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

���

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

� �

��

��

� �

Activity TypesRegistrationExamYouTubeTwitter

050

0

Jan

1

Cour

seSt

art

Mar

1

May

1

Fina

lEx

am

1000

1500

contributed articles

62 communicAtionS of the Acm | APRil 2014 | vOl. 57 | NO. 4

with discussion activity increasing over the semester. Lecture question events decay early as homework activity in-creases. Textbook use peaks during ex-ams, and there is a noticeable drop in textbook activity after the midterm, as is typical in traditional courses.18

Time on tasks. Time represents the principal cost function for students, so it is important to study how students allocate time among available course components.15,19 Figure 4 shows the most time is spent on lecture videos; since three to four hours per week is close to the total duration of the sched-uled videos, students who rewound and reviewed the videos must compen-sate for those speeding up playback or omitting videos.

The most significant change over the first seven weeks was the apparent transfer of time from lecture questions to homework, as in Figure 4. Consider-ing a performance-goal orientation (see Figure 5), it should be noted that home-work counted toward the course grade, whereas lecture questions did not. But even on mastery-oriented grounds, stu-dents might have viewed completion of homework as sufficient evidence of understanding lecture content. The prominence of time spent in discussion

these extremes, only a noticeable shoul-der (see Figure 2a). The intermediate durations are filled with attempters we divided into tranches (in colors) on the basis of how many assessment items they attempted on homework and ex-ams: browsers (gray) attempted < 5% of homework; tranche 1 (red) 5%–15% of homework; tranche 2 (orange) 15%–25% of homework; tranche 3 (green) > 25% of homework; and tranche 4 (cyan) >25% of homework and 25% of midterm exam. Certificate earners (purple) attempted most of the available homework, mid-term, and final exams. The median total time spent in the course for each tranche was 0.4 hours, 6.4 hours, 13.1 hours, 30.0 hours, 53.0 hours, and 95.1 hours, respectively. In addition to these tranches, just over 150 certificate earn-ers spent fewer than 10 hours in the course, possibly representing a highly skilled tranche seeking certification. Similarly, just over 250 test takers spent fewer than 10 hours in the course and completed more than 25% of both ex-ams but did not earn a certificate.

The average time spent in hours per week for participants in each tranche is shown in Figure 2c. Tranches at-tempting fewer assessment items not only taper off earlier, as the majority

of participants effectively drop out, but also invested less time in the first few weeks than the certificate earn-ers. The correlation of attrition with less time spent in early weeks begs the question of whether motivating students to invest more time would increase retention rates.

In the rest of this article, we re-strict ourselves to certificate earners, as they accounted for the majority of resource consumption; we also want-ed to study time and resource use over the whole semester.

Frequency of accesses. Figure 3a shows the number of active users per day for certificate earners, with large peaks on Sunday deadlines for graded homework and labs but not for lecture questions. There is a downward trend in the weeks between the midterm and the final exam (shaded regions). No homework or labs were assigned in the last two weeks before the final exam, though the peaks persist. We plotted activity in events (clicks subject to time cutoffs) per active student per day for assessment-based course components and learning-based components in Fig-ure 3b and Figure 3c. Homework sets and the discussion forums account for the highest rate of activity per student,

figure 3. frequency of accesses.

From left to right, number of unique certificate earners N active per day, their average number of accesses each day for assessment-based and learning-based course components. Plot (a) highlights the periodicity and trends of the certificate earners. Plot (b) is for assessment, including homework, lab, and lecture questions, showing number of accesses per active users that day. learning-based components in plot (c) include lecture videos, textbook, discussion, tutorial, and wiki, showing discussion forums were used more heavily and with strong periodicity later in the term, similar to graded activities in plot (a), while other components lack periodicity and vary greatly in terms of frequency of accesses.

Nu

mb

er o

f u

niq

ue

use

rs

Wee

k 1

Wee

k 2

Wee

k 3

Wee

k 4

Wee

k 5

Wee

k 6

Wee

k 7

Wee

k 8

Wee

k 9

Wee

k 10

Wee

k 11

Wee

k 12

Wee

k 13

Wee

k 14

5,000

3,000

1,000

0 Midterm Final

(a)

Nu

mb

er o

f ev

ents

/N(d

ay)

Wee

k 1

Wee

k 2

Wee

k 3

Wee

k 4

Wee

k 5

Wee

k 6

Wee

k 7

Wee

k 8

Wee

k 9

Wee

k 10

Wee

k 11

Wee

k 12

Wee

k 13

Wee

k 14

5

10

15

20

25

0 Midterm Final

(b)

Homework LabLecture Question

Nu

mb

er o

f ev

ents

/N(d

ay)

Wee

k 1

Wee

k 2

Wee

k 3

Wee

k 4

Wee

k 5

Wee

k 6

Wee

k 7

Wee

k 8

Wee

k 9

Wee

k 10

Wee

k 11

Wee

k 12

Wee

k 13

Wee

k 14

5

10

15

20

25

0 Midterm Final

(c)

Tutorial WikiDiscussion Lecture VideoBook

The shaded regions near Week 8 and Week 14 represent the time span for the midterm and final exams.

Observed events by date

Observed events by date

Number of

observed events

2012-03-03

2012-03-09

2012-03-15

2012-03-21

2012-03-27

2012-04-02

2012-04-08

2012-04-14

2012-04-20

2012-04-26

2012-05-02

2012-05-08

2012-05-14

2012-05-20

2012-05-26

2012-06-01

2012-06-07

2012-06-13

0

1,500,000

3,000,000

4,500,000

6,000,000

Date

Nu

mb

er

of

ob

serv

ed

eve

nts

(a) 6.002x: MIT/MITx

10/14/13 observed_events_per_day_date_label.html

file:///C:/Users/francky/Downloads/output (1)/output/observed_events_per_day_date_label.html 1/1

Observed events by date

Observed events by date

Number of

observed events

2013-01-13

2013-01-16

2013-01-19

2013-01-22

2013-01-25

2013-01-28

2013-01-31

2013-02-03

2013-02-06

2013-02-09

2013-02-12

2013-02-15

2013-02-18

2013-02-21

2013-02-24

2013-02-27

2013-03-02

0

20,000

40,000

60,000

80,000

Date

Nu

mb

er

of

ob

serv

ed

eve

nts

(b) Crypto 1: Stanford/Coursera

Figure 5: Example of interactive visual analytics showing the number of observing events by day.

currently available at http://moocdbfeaturediscovery.csail.mit.edu/. A powerful, unique featureof this platform is that it effectively democratizes MOOC data science by providing access to thosewho cannot either directly work with or gain access to specific data. It will allow many more peopleto participate in MOOC data science.

Summary

Our progress to date has been to build community support, develop software frameworks and mature initialconcepts for open source frameworks for MOOC data science. In the coming year we envision fully developingthese frameworks, seeking more community involvement and collaborations while helping accomplish ourultimate goal of improving educational outcomes through data science.

Acknowledgements

The authors acknowledge discussions and support from a number of researchers and stakeholders in theMOOC space. The following researchers’ contributions have been instrumental in developing the frameworksand building community support: Sherif Halawa, Andreas Paepcke (Stanford U.), Chuong Do (Coursera),Franck Dernoncourt, Colin Taylor, Sherwin Wu, Elaine Han (MIT), Piotr Mitros, Rob Rubin (EdX).

References

[1] K. Veeramachaneni, F. Dernoncourt, C. Taylor, Z. Pardos, and U.-M O’Reilly. Moocdb: Developing datastandards for mooc data science. 1st workshop on Massive Open Online Courses at 16th AIED, 2103.

[2] K. Veeramachaneni, U.-M O’Reilly, F Dernoncourt, C. Taylor, S. Halawa, and C Do. Developing stan-dards and systems for mooc data science. MOOCDB Technical Report, CSAIL, MIT, 2013.

6

Legend

Students

Top 5 Countries2013U.S. (33.4%)India (6.8%)U.K. (4.6%)Canada (3.7%)Netherlands (3.5%)

2014U.S. (39.5%)India (8.7%)U.K. (5.1%)Canada (3.3%)Spain (2.7%)

1

390

780

Year

2014

2013

253 students did not identify a country in 2013110 students did not identify a country in 2014

US IN CN RU DE PL BR0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

lecture

tutorialinformationalproblemexamwikiprofilebookother

(a) 6.002x: MIT/MITx

US IN CN RU DE PL BR0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

lecture

examwikiforum

index

home

other

(b) Crypto 1: Stanford/Coursera

Figure 4: Differences in relative use of resources by students from different countries. A student’s country isderived from the IP address s/he commonly logs in from.

A B C0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

lecture

tutorialinformationalproblem

examwikiprofilebookother

(a) 6.002x: MIT/MITx

A B C0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

lecture

examwikiforum

index

home

other

(b) Crypto 1: Stanford/Coursera

Figure 5: Differences in relative use of resources based on grade cohorts.

contributed articles

APRil 2014 | vOl. 57 | NO. 4 | communicAtionS of the Acm 63

forums is especially noteworthy, as they were neither part of the course sequence nor did they count for credit. Students presumably spent time in discussion forums due to their utility, whether pedagogical or social or both. The small spike in textbook time at the midterm, a larger peak in the number of accesses, as in Figure 3, and the decrease in text-book use after the midterm are typical of textbook use when online resources are blended with traditional on-campus courses.18 Further studies comparing blended and online textbook use are also relevant.3,17

Percentage use of course compo-nents. Along with student time alloca-tion, the fractional use of the various course components continues to be an important metric for instructors decid-ing how to improve their courses and researchers studying the influence of course structure on student activity and learning. For fractional use, we plotted the percentage of certificate earners having accessed at least a certain per-centage of resources in a course compo-nent (see Figure 5). Homework and labs (each 15% of overall grade) reflect high fractional use. The inflection in these curves near 80% might have been higher but for the course policy of dropping the two lowest-graded assignments. The low proportionate use of textbook and tutorials is similar to the distribution

observed for supplementary (not explic-itly included in the course sequence) e-texts in large introductory physics courses,16 though the 6.002x textbook was assigned in the course syllabus. The course authors were disappointed with the limited use of tutorial videos, sus-pecting that placing tutorials after the homework and laboratory (they were meant to help) in the course sequence

was partly responsible. (The wiki and discussion forums had no defined num-ber of resources so are excluded here.)

To better understand the middle curves representing lecture videos and lecture problems, it helps to recall that the negative slope of the curve is the density of students accessing that fraction of that course component (see Figure 5b and Figure 5c). Interestingly,

figure 4. time on tasks.

Certificate earners average time spent, in hours per week, on each course component; midterm and final exam weeks are shaded.

Tota

l tim

e (h

rs)/

N (

cert

ifica

te e

arn

ers)

Wee

k 1

Wee

k 2

Wee

k 3

Wee

k 4

Wee

k 5

Wee

k 6

Wee

k 7

Wee

k 8

Wee

k 9

Wee

k 10

Wee

k 11

Wee

k 12

Wee

k 13

Wee

k 14

0.5

1.0

1.5

2.0

2.5

3.0

3.5

0.0

HomeworkLab

Lecture QuestionDiscussionBook

Lecture VideoTutorial Wiki

Midterm Final

figure 5. fractional use of resources.

(a) Percentage of certificate earners who accessed greater than %R of that type of course resource. The density of users is the negative slope of the usage curve. Two points indicating bimodality of lecture video use are plotted: 76% of students accessed > 20% of lecture videos, and 33% of students accessed > 80% of lecture videos. (b) Bimodal distribution for videos accessed (as percentage). And (c) distribution of lecture questions accessed.

Lecture Question

Lecture Video

%N

– C

erti

fica

te e

arn

ers

acce

ssin

g >

%R

– R

esou

rces

20

40

60

80

100

0

0 20 40 60 80 100

%R – Percentage of Resources Accessed

HomeworkLab BookLecture Video

Lecture QuestionTutorial

0 10 20

%N

Cer

tifi

cate

ear

ner

s

%R Resources30 40 50 60 70 80 90 100

0

1

2

3

4

0 10 20

%N

Cer

tifi

cate

ear

ner

s

%R Resources30 40 50 60 70 80 90 100

0369

12

76% of students accessed > 20% of videos

33% of students accessed > 80% of videos

0 25 50 75 100 125 150

# of assignment questions submitted

100

101

102

#of

user

s

101 102 103

# of quiz attempts

0

10

20

30

40

50

60

#of

user

s

Figure 8: Number of handed-in assignments (left) and quizzes(right) for high-achievers.

0 2 4 6 8 10 12 14

Thread length

0

2

4

6

8

10

12

#di

stin

ctco

ntri

buto

rs

ML1ML2ML3

PGM1PGM2PGM3

Figure 9: Number of distinct contributors as a function ofthread length.

4. COURSE FORUM ACTIVITYWe now move on to our second main focus of the paper, the

forums, which provide a mechanism for students to interact witheach other. Because Coursera’s forums are cleanly separated fromthe course materials, students can choose to consume the coursecontent independently of the other students, or they can also com-municate with their peers.

Following our classification of students into engagement styles,our first question is a simple one: which types of students visitthe forums? To answer this, we compute the distribution of en-gagement styles for the population of students who read at leastone thread on ML3 (shown in the top row of Table 3). The rep-resentation of engagement styles on the forum is significantly dif-ferent from the class as a whole, with more active students over-represented on the forum; for example, Bystanders comprise over50% of registered students but only 10% of the forum population.

We also compute the fraction of each engagement style presenton the forum (the bottom row of Table 3). It is striking that 90% ofAll-rounders are forum readers, meaning that the two populationsheavily overlap. While numerically the forum is used by a smallfraction of the full population of registered students, this is a super-ficial measure; using our engagement taxonomy it is apparent thata large majority of the most engaged students are on the forum.

The composition of threads. The forum is organized in a sequenceof threads: each thread starts with an initial post from a student,which is then potentially followed by a sequence of further posts.Threads cover a variety of topics: discussion of course content, aquestion followed by proposed answers, and organizational issuesincluding attempts by students to find study groups they can join.

Forum threads are a feature of a wide range of Web sites—socialnetworking sites, news sites, question-answer sites, product-reviewsites, task-oriented sites—and they are used quite differently acrossdomains. Thus one has to be careful in adapting existing intuitionsabout forums to the setting of online courses—in principle it would

Bystander Viewer Collector All-rounder Solver

P (S|F ) 0.106 0.277 0.192 0.408 0.017P (F |S) 0.050 0.334 0.369 0.894 0.648

Table 3: How engagement styles are distributed on the ML3forum. P (S|F ) is probability of engagement style given forumpresence (reading or writing to at least one thread); P (F |S) isprobability of forum presence given engagement style.

be plausible to conjecture that the forum might be a place wherestudents engage in back-and-forth discussions about course con-tent, or a place where students ask questions that other studentsanswer, or a place where students weigh in one after another on aclass-related issue. Our goal here is to develop an analysis frame-work that can clarify how the forums are in fact being used.

In particular, we’d like to address the following questions:

• Does the forum have a more conversational structure, in whicha single student may contribute many times to the same threadas the conversation evolves, or a more straight-line structure,in which most students contribute just once and don’t return?

• Does the forum consist of high-activity students who initiatethreads and low-activity students who follow up, or are thethreads initiated by less central contributors and then pickedup by more active students?

• How do stronger and weaker students interact on the forum?• Can we identify features in the content of the posts that indi-

cate which students are likely to continue in the course andwhich are likely to leave?

The course forums contain many threads of non-trivial length,and we ask whether these threads are long because a small set ofpeople are each contributing many times to a long conversation, orwhether they are long because a large number of students are eachcontributing roughly once.

As a first way to address this question, we study the mean num-ber of distinct contributors in a thread of length k, as a function ofk. If this number is close to k, it means that many students are con-tributing; if it is a constant or a slowly growing function of k, thena smaller set of students are contributing repeatedly to the thread.

We find that the number of distinct contributors grows linearlyin k (see Figure 9): a thread with k posts has roughly 2k/3 distinctcontributors. Moreover, this slope is markedly consistent acrossall six courses in our data. The linear growth in distinct contribu-tors forms an interesting contrast with discussion-oriented sites; forexample, on Twitter and Facebook, the number of distinct contrib-utors in a thread of length k grows sublinearly in k [9, 2].

Now, it is possible for the actual number of distinct contributorsto exhibit two modes; for example, long threads on Facebook havethis multi-modal behavior, as long conversational threads amonga few users co-exist with “guest-book” style threads that bring inmany users [2]. In our domain, however, we find a single mode nearthe mean number of distinct users; long conversational threads withvery few contributors are extremely rare in these course forums.

Properties of thread contributors. Even if we know the forumis dominated by threads with many distinct contributors, there arestill several possible dynamics that could be at work—for example,a top-down mechanism in which a high-activity forum user startsthe discussion, or an initiator-response mechanism in which a lessactive user begins the thread and more active users continue it.

One way to look at this question is to plot a student’s average fo-rum activity level as a function of her position in the thread—that

Our analysis is made possible by the granular level of detailavailable in the activity traces on Stack Overflow4. Each individualaction performed by a user is recorded and timestamped, which af-fords us the ability to directly observe the complete sequence ofactions users take and measure their progress towards obtainingbadges. We use Stack Overflow data from the site’s inception onJuly 31, 2008 to December 31, 2010.Activity around the badge boundary. We first examine howusers’ propensities to take different types of actions vary as theyapproach the badge boundary. We aim to analyze both how usersshift their effort between actions on the site and change their overalllevel of site activity. For each user we bin the number of actions ofeach type by day. This way, changes in the relative number of var-ious types of actions per day indicate a user shifting his efforts onthe site, and the daily sum over all actions measures his overall par-ticipation level (where an increase or decrease in participation canbe interpreted as steering away from or towards the life-action).

For each badge, we take the complete set of users who everachieved that badge and axis-align their activity profiles by letting“day 0” denote the day they receive the badge. To eliminate pos-sible population effects, we restrict the set of users to those whowere active at least 60 days before and after they win the badge.Figure 3 shows user activity in the days surrounding the award-ing of the Electorate and Civic Duty badges. Notice how activ-ity on the targeted actions (Q-votes for Electorate, Q-votes and A-votes for Civic Duty) increases substantially before users achievethe badge, and then almost immediately returns to near-baselinelevels. Also notice that most of the other site actions are not ad-versely affected—the rates of these actions remain relatively sta-ble over time. Since the four actions shown in the Figure are themain activities on the site, this means users increase their overallactivity level on the site in the days leading up to achieving thesebadges. The one exception is the A-vote curve in the Electoratebadge, which drops in the days leading up to the badge boundary.This is evidence that users are steering their behavior from A-votesto Q-votes during this time.Turning towards the badge. Given that we do indeed see the sortof steering behavior predicted by our model, where user activityincreases near a badge boundary (in these cases at the expense ofthe life-action), we now examine how users steer towards badgeboundaries. One of the main qualitative predictions of our modelis that users will “turn” towards badge boundaries, meaning theywill deviate more from their preferred actions as they get closer toreceiving a badge. We test this prediction by computing which ac-tions a user has taken over the course of his lifetime, and examininghow actions change as a function of position in the action space.

We proceed as follows. For every state in the action space, wecompute the empirical distribution over site actions that users tookin that state. For example, for all users who at one point duringtheir lifetimes had contributed exactly 11 questions, 17 answers, 20question-votes, and 11 answer-votes, we calculate the distributionover the next action they are going to take. The resulting distribu-tion represents the aggregate direction users traveled at that point inthe action space. The composition of these directions forms a vec-tor field like the one we modeled in Figure 2. Since this vector fieldis 4-dimensional, we visualize its projection onto the question-voteand answer-vote dimensions in Figure 4 (top), where hotter colorsrepresent higher likelihoods of the next action being a question-vote(and cooler colors represent higher likelihoods of the next actionbeing an answer-vote).

4Stack Overflow generously gave us a complete trace of actions,but qualitatively similar results are derivable from the data that ispublicly available on the Stack Overflow website.

−60 −40 −20 0 20 40 60

Number of days relative to badge win

0

2

4

6

8

10

12

14

Num

ber

ofac

tion

spe

rda

y ElectorateQsAsQ-votesA-votes

−60 −40 −20 0 20 40 60

Number of days relative to badge win

0

2

4

6

8

10

Num

ber

ofac

tion

spe

rda

y Civic DutyQsAsQ-votesA-votes

Figure 3: Number of actions per day as a function of number ofdays relative the time of obtaining a badge. Notice steering inthe sense of increased activity on actions targeted by the badge.

The first salient feature of the vector field is the gradient fromhot to cold as the angle departing the origin varies between thetwo extremes. The fact that the color stays the same along a givendirection starting from the origin is a validation of our modelingassumption that users have preferred distributions over the actiontypes: one interpretation consistent with this gradient is that userstend to travel in the direction they have already traveled in.

To more clearly illustrate the “turning” effect, we normalize outthe tendency of users to act as they have acted in the past by sub-tracting off the direction of each cell (so that each cell shows thedifference between the empirical fraction of actions users chose andthe fraction given by the position of the cell). In the resulting plot,Figure 4 (bottom), white indicates a direction matching the vec-tor corresponding to the position in action space, black indicatesa higher probability of taking a question-vote, and red indicates alower probability of taking a question-vote.

The dominant white color for small x values shows that awayfrom badge boundaries, users do not deviate much from the direc-tion they have already taken. For greater x values, we clearly seethat once users get near the Electorate badge boundary, they startperforming more question votes (than we would expect given theirposition in the action space). Furthermore, this shifting towardsquestion votes intensifies as they approach the boundary (the dark-est greys occur right before the boundary). As soon as users obtainthe badge, they shift back away from question voting (doing thisaction less than we would expect given their position, as indicatedby the red color). That the badge boundary naturally emerges fromobserving users’ directions in action space is a striking confirma-tion that badges influence user behavior and supports our multi-dimensional modeling framework. Additionally, the increase in in-tensity as we move along the x axis agrees with our model’s qual-itative prediction that users are increasingly incentivized as theyapproach badge boundaries.

Data ManagerData MinerDesigner

Nodes

Direct communication via TwitterGroup membership

Edges

Visualization Expert

LibrarianProgrammerProject ManagerUsability Expert

No Information

contributed articles

64 communicAtionS of the Acm | APRil 2014 | vOl. 57 | NO. 4

es, fractional use of those resources, and use of resources during problem solving. Among the more significant findings is that participants who at-tempted over 5% of the homework rep-resented only 25% of all participants but accounted for 92% of the total time spent in the course; indeed, 60% of the time was invested by the 6% who ultimately received certificates. Par-ticipants who left the course invested less effort than certificate earners, with those investing the least effort during the first two weeks tending to leave sooner. Most certificate earners invested the plurality of their time in lecture videos, though approximately 25% of the earners watched less than 20%. This suggests the need for a follow-up investigation into the cor-relations between resource use and learning. Finally, we highlight the significant popularity of the discus-sion forums in spite of being neither required nor included in the naviga-tion sequence. If this social learning component played a significant role in the success of 6.002x, a totally asyn-chronous alternative might be less ap-pealing, at least for a complex topic like circuits and electronics.

Some of these results echo effects seen in on-campus studies of how course structure affects resource use18 and performance outcomes4,11,19 in introductory (college) courses. This

the distribution for the lecture videos is distinctly bimodal: 76% of students accessed over 20% of the videos (or 24% of students accessed less than 20%), and 33% accessed over 80% of the videos. This bimodality merits fur-ther study into learning preferences; for example, do some students learn from other resources exclusively? Or did they master the content prior to the course? The distribution of lec-ture-problem use is flat between 0% and 80%, then rises sharply, indicating that many students accessed nearly all of them. Along with the fact that the time on lecture questions drops steadily in the first half of the term (see Figure 2), this distribution suggests students not only allocated less time to them, some abandoned the lecture problems entirely.

Resources used when problem solv-ing. Patterns in the sequential use of resources by students may hold clues to cognitive and even affective state.2 We therefore explored the interplay be-tween use of assessment and learning resources by transforming time-series data into transition matrices between resources. The transition matrix con-tains all individual resource-resource transitions we aggregated into transi-tions between major course compo-nents. The completeness of the 6.002x learning environment means students did not have to leave it to reference the

textbook, review earlier homework, or search the discussion forums. We thus had a unique opportunity to observe transitions to all course components accessed by students while working problems. In previous studies of on-line problem solving this information was simply missing.21

Figure 6 highlights student transi-tions from problems (while solving them) to other course components, treating homework sets, the midterm, and the final exam as separate assess-ment types of interest. Figure 6 shows the discussion forum is the most fre-quent destination during homework problem solving, though lecture videos consume the most time. During exams (midterm and final are similar), previ-ously done homework is the primary destination, while the book consumes the most time. Student behavior on exam problems thus contrasts sharply with behavior on homework problems. Note that because homework was ag-gregated, we could not isolate “refer-ences to previous assignments” for stu-dents doing homework.

conclusion This article’s major contribution to course analysis is showing how MOOC data can be analyzed in qualitatively different ways to address important issues: attrition/retention, distribu-tion of students’ time among resourc-

figure 6. transitions to other components during problem solving on (a) homework, (b) midterm, and (c) final. Arrows are thicker in proportion to overall number of transitions, sorting components from top to bottom; node size represents total time spent on that component.

(a) Homework (b) Midterm Exam (c) Final Exam

Discussion

Lab

Lecture Video

Lecture Question

Book

Tutorial

Wiki

Homework

Discussion

Lab

Book

Lecture Video

Lecture Question

Wiki

Tutorial

Homework

Lab

Discussion

Lecture Video

Book

Lecture Question

Wiki

Tutorial

TEI with John Walsh pt 2 ]TEI with John Walsh pt 1 ]

Plotly Demo ]Exporting Networks From Gephi with Seadragon (Zoom.it) ]

Evolving Networks with Gephi ]Introduction ]

The Making of AcademyScope ]Color Perception and Reproduction ]

Deployment ]Hans Rosling's Gapminder ]

Dynamics ]Exemplary Visualizations ]

Welcome ]Weekly Tip: How to Use Property Files ]

Bipartite Networks: Mapping CTSA Centers ]Directed Networks: Paper−Citation Network ]

Co−Occurrence Networks: NSF Co−Investigators ]Introduction ]

Exhibit Map with Andrea Scharnhorst ]Error and Attack Tolerance ]

Backbone Identification ]Clustering ]

Workflow Design ]Overview and Terminology ]

Exemplary Visualizations ]Welcome ]

Weekly Tip: Create your own TreeML (XML) Files ]Visualizing Directory Structures ]

Introduction ]Algorithm Comparison ]

Workflow Design ]Overview and Terminology ]

Exemplary Visualizations ]Welcome ]

Weekly Tip: Removing Files from the Data Manager ]Word Co−Occurrence Networks with Sci2 ]Mapping Topics and Topic Bursts in PNAS ]

Introduction ]Comparison of Text and Linkage−Based Approaches ]

Design and Update of a Classifcation System: The UCSD Map of Science ]Workflow Design ]

Overview and Terminology ]Exemplary Visualizations ]

Welcome ]Weekly Tip: Memory Allocation ]

Geocoding NSF Funding with the Generic Geocoder ]Congressional District Geocoder ]

Choropleth and Proportional Symbol Map ]Introduction ]

Color ]Workflow Design ]

Overview and Terminology ]Exemplary Visualizations ]

Welcome ]Weekly Tip: Sci2 Log Files ]

Burst Detection in Publication Titles ]Temporal Bar Graph: NSF Funding Profiles ]

Introduction ]Burst Detection ]

Workflow Design ]Overview and Terminology ]

Exemplary Visualizations ]Welcome ]

Weekly Tip: Extending Sci2 by adding Plugins ]Legend Creation with Inkscape ]

Download, Install, and Visualize Data with Sci2 ]Introduction ]

Workflow Design ]Visualization Framework ]

Course Overview ]Welcome ]

New

Week 7

Week 6

Week 5

Week 4

Week 3

Week 2

Week 1

IVMOOC Video Views

Number of Hits

0 200 400 600 800

Video Types

Theory 2013Hands−On 2013

Theory 2014Hands-On 2014

TEI with John Walsh pt 2 ]TEI with John Walsh pt 1 ]

Plotly Demo ]Exporting Networks From Gephi with Seadragon (Zoom.it) ]

Evolving Networks with Gephi ]Introduction ]

The Making of AcademyScope ]Color Perception and Reproduction ]

Deployment ]Hans Rosling's Gapminder ]

Dynamics ]Exemplary Visualizations ]

Welcome ]Weekly Tip: How to Use Property Files ]

Bipartite Networks: Mapping CTSA Centers ]Directed Networks: Paper−Citation Network ]

Co−Occurrence Networks: NSF Co−Investigators ]Introduction ]

Exhibit Map with Andrea Scharnhorst ]Error and Attack Tolerance ]

Backbone Identification ]Clustering ]

Workflow Design ]Overview and Terminology ]

Exemplary Visualizations ]Welcome ]

Weekly Tip: Create your own TreeML (XML) Files ]Visualizing Directory Structures ]

Introduction ]Algorithm Comparison ]

Workflow Design ]Overview and Terminology ]

Exemplary Visualizations ]Welcome ]

Weekly Tip: Removing Files from the Data Manager ]Word Co−Occurrence Networks with Sci2 ]Mapping Topics and Topic Bursts in PNAS ]

Introduction ]Comparison of Text and Linkage−Based Approaches ]

Design and Update of a Classifcation System: The UCSD Map of Science ]Workflow Design ]

Overview and Terminology ]Exemplary Visualizations ]

Welcome ]Weekly Tip: Memory Allocation ]

Geocoding NSF Funding with the Generic Geocoder ]Congressional District Geocoder ]

Choropleth and Proportional Symbol Map ]Introduction ]

Color ]Workflow Design ]

Overview and Terminology ]Exemplary Visualizations ]

Welcome ]Weekly Tip: Sci2 Log Files ]

Burst Detection in Publication Titles ]Temporal Bar Graph: NSF Funding Profiles ]

Introduction ]Burst Detection ]

Workflow Design ]Overview and Terminology ]

Exemplary Visualizations ]Welcome ]

Weekly Tip: Extending Sci2 by adding Plugins ]Legend Creation with Inkscape ]

Download, Install, and Visualize Data with Sci2 ]Introduction ]

Workflow Design ]Visualization Framework ]

Course Overview ]Welcome ]

New

Week 7

Week 6

Week 5

Week 4

Week 3

Week 2

Week 1

IVMOOC Video Views

Number of Hits

0 200 400 600 800

Video Types

Theory 2013Hands−On 2013

Theory 2014Hands-On 2014

1

3

contributed articles

APRil 2014 | vOl. 57 | NO. 4 | communicAtionS of the Acm 61

only direct interactions with the home-work are logged with homework re-sources. There are clearly alternatives to this approach (such as considering all time between opening and answering a problem as problem-solving time21). Our time-accumulation algorithm is partially thwarted by users who open multiple browser windows or tabs; edX developers are considering ways to ac-count for this in the future.

Results The novelty and publicity surrounding MOOCs in early 2012 attracted a large number of registrants who were more curious than serious. We still take par-ticipation in assessment as an indica-tion of serious intent. Of the 154,000 registrants in 6.002x in spring 2012, 46,000 never accessed the course, and the median time spent by all remain-ing participants was only one hour (see Figure 2a). We had expected a bimodal distribution of total time spent, with a large peak of “browsers” who spent only on the order of one hour and another peak from the certificate earners at somewhere more than 50 hours. There was, in fact, no minimum between

figure 1. Screenshot of typical student view in 6.002x.

All course components are accessed from the interface shown below. The left sidebar defines the course sequence; weekly units include lecture sequences (videos and questions), homework, lab, and tutorials. The header navigation provides access to supplementary materials, including digital textbook, discussion forums, and wiki. The main frame represents the first lecture sequence; beige boxes below the header indicate lecture videos and questions.

figure 2. tranches, total time, and attrition.

60%8%

5%

5%

12%

10%

0

log (t[hours])

(a) (b) (c)

Nu

mb

er o

f p

arti

cip

ants

sp

end

ing

log

(t)

tim

e

Tota

l Tim

e (h

rs)/

N (

gro

up

)

1 m

in

10 m

in

1 hr

10 h

r

100

hr

500

hr

500

1,000

1,500

Wee

k 1

Wee

k 2

Wee

k 3

Wee

k 4

Wee

k 5

Wee

k 6

Wee

k 7

Wee

k 8

Wee

k 9

Wee

k 10

Wee

k 11

Wee

k 12

Wee

k 13

Wee

k 14

12 Midterm Final

10

8

6

4

2

0

BrowsersAttempted > 5% HWAttempted > 15% HWAttempted > 25% HW Certificate earners

Attempted > 25% HW and > 25% Midterm

(a) Distribution of time spent by participants in 6.002x (time axis is log-transformed); we divided the noncertificate earners into tranches based on percentage of assessment activity they attempted (see also Table 1);

(b) percentage of total measured time spent by each tranche; and (c) average time a student invested per week. The shaded regions near Week 8 and Week 14 represent the time span for the midterm and final exams.

contributed articles

APRil 2014 | vOl. 57 | NO. 4 | communicAtionS of the Acm 61

only direct interactions with the home-work are logged with homework re-sources. There are clearly alternatives to this approach (such as considering all time between opening and answering a problem as problem-solving time21). Our time-accumulation algorithm is partially thwarted by users who open multiple browser windows or tabs; edX developers are considering ways to ac-count for this in the future.

Results The novelty and publicity surrounding MOOCs in early 2012 attracted a large number of registrants who were more curious than serious. We still take par-ticipation in assessment as an indica-tion of serious intent. Of the 154,000 registrants in 6.002x in spring 2012, 46,000 never accessed the course, and the median time spent by all remain-ing participants was only one hour (see Figure 2a). We had expected a bimodal distribution of total time spent, with a large peak of “browsers” who spent only on the order of one hour and another peak from the certificate earners at somewhere more than 50 hours. There was, in fact, no minimum between

figure 1. Screenshot of typical student view in 6.002x.

All course components are accessed from the interface shown below. The left sidebar defines the course sequence; weekly units include lecture sequences (videos and questions), homework, lab, and tutorials. The header navigation provides access to supplementary materials, including digital textbook, discussion forums, and wiki. The main frame represents the first lecture sequence; beige boxes below the header indicate lecture videos and questions.

figure 2. tranches, total time, and attrition.

60%8%

5%

5%

12%

10%

0

log (t[hours])

(a) (b) (c)

Nu

mb

er o

f p

arti

cip

ants

sp

end

ing

log

(t)

tim

e

Tota

l Tim

e (h

rs)/

N (

gro

up

)

1 m

in

10 m

in

1 hr

10 h

r

100

hr

500

hr

500

1,000

1,500

Wee

k 1

Wee

k 2

Wee

k 3

Wee

k 4

Wee

k 5

Wee

k 6

Wee

k 7

Wee

k 8

Wee

k 9

Wee

k 10

Wee

k 11

Wee

k 12

Wee

k 13

Wee

k 14

12 Midterm Final

10

8

6

4

2

0

BrowsersAttempted > 5% HWAttempted > 15% HWAttempted > 25% HW Certificate earners

Attempted > 25% HW and > 25% Midterm

(a) Distribution of time spent by participants in 6.002x (time axis is log-transformed); we divided the noncertificate earners into tranches based on percentage of assessment activity they attempted (see also Table 1);

(b) percentage of total measured time spent by each tranche; and (c) average time a student invested per week. The shaded regions near Week 8 and Week 14 represent the time span for the midterm and final exams.

contributed articles

APRil 2014 | vOl. 57 | NO. 4 | communicAtionS of the Acm 61

only direct interactions with the home-work are logged with homework re-sources. There are clearly alternatives to this approach (such as considering all time between opening and answering a problem as problem-solving time21). Our time-accumulation algorithm is partially thwarted by users who open multiple browser windows or tabs; edX developers are considering ways to ac-count for this in the future.

Results The novelty and publicity surrounding MOOCs in early 2012 attracted a large number of registrants who were more curious than serious. We still take par-ticipation in assessment as an indica-tion of serious intent. Of the 154,000 registrants in 6.002x in spring 2012, 46,000 never accessed the course, and the median time spent by all remain-ing participants was only one hour (see Figure 2a). We had expected a bimodal distribution of total time spent, with a large peak of “browsers” who spent only on the order of one hour and another peak from the certificate earners at somewhere more than 50 hours. There was, in fact, no minimum between

figure 1. Screenshot of typical student view in 6.002x.

All course components are accessed from the interface shown below. The left sidebar defines the course sequence; weekly units include lecture sequences (videos and questions), homework, lab, and tutorials. The header navigation provides access to supplementary materials, including digital textbook, discussion forums, and wiki. The main frame represents the first lecture sequence; beige boxes below the header indicate lecture videos and questions.

figure 2. tranches, total time, and attrition.

60%8%

5%

5%

12%

10%

0

log (t[hours])

(a) (b) (c)

Nu

mb

er o

f p

arti

cip

ants

sp

end

ing

log

(t)

tim

e

Tota

l Tim

e (h

rs)/

N (

gro

up

)

1 m

in

10 m

in

1 hr

10 h

r

100

hr

500

hr

500

1,000

1,500

Wee

k 1

Wee

k 2

Wee

k 3

Wee

k 4

Wee

k 5

Wee

k 6

Wee

k 7

Wee

k 8

Wee

k 9

Wee

k 10

Wee

k 11

Wee

k 12

Wee

k 13

Wee

k 14

12 Midterm Final

10

8

6

4

2

0

BrowsersAttempted > 5% HWAttempted > 15% HWAttempted > 25% HW Certificate earners

Attempted > 25% HW and > 25% Midterm

(a) Distribution of time spent by participants in 6.002x (time axis is log-transformed); we divided the noncertificate earners into tranches based on percentage of assessment activity they attempted (see also Table 1);

(b) percentage of total measured time spent by each tranche; and (c) average time a student invested per week. The shaded regions near Week 8 and Week 14 represent the time span for the midterm and final exams.

6/30/13 francky.me/mit/moocdb/all/user_certifcate_per_country_normalized_cutoff100.html

francky.me/mit/moocdb/all/user_certifcate_per_country_normalized_cutoff100.html 1/1

Percentage of students who got a certificate

(a) 6.002x: MIT/MITx (b) Crypto 1: Stanford/Coursera

Figure 4: Left plot shows, via coloring, the ratio of certificate winners to the number of registrants on a percountry basis for 6.002x offered via edX platform. Hungary (16.21%), Spain (14.55%) and Latvia (14.40%)are the highest. Right plot shows, via coloring, the ratio of certificate winners to the number of registrants ona per country basis for the Stanford cryptography course offered via Coursera. Russia (17.24%), Netherlands(16.43%) and Germany (12.95%) are highest.

Coursera course edX CourseTitle Cryptography I Circuits and Electronics (6.002x)

Instructors Dan Boneh Anant Agarwal, Gerald Sussman, Piotr MitrosUniversity Stanford University MITLength 6 weeks 14 weeksPlatform Coursera edXStart date Jan 13th, 2013 March 5, 2012Registrants 21,744 154,763

Table 1: Courses overview

allows researchers to refine the scripts. This framework is called MOOCViz. For its current appearancesee Figure 6. MOOCViz is currently available at http://moocviz.csail.mit.edu/.

MOOCViz, when completed, will allow analytic and visualization scripts for MOOC data analyses to beshared, under the understanding that the data being analyzed is organized according to the MOOCDBdata model. MOOCViz activity entails developing a set of templates and support for software sharingplus developing a means of sharing MOOC analytics results through web-based galleries.

Initial design of feature foundry: Another web based platform currently under development will allowthe community to propose variables of interest to be extracted from the data. It is targeted to enable the“crowd” (for example, students taking database courses) to extract and operationalize these variables.This platform will make mock data available. This data is generated by statistical modeling the realdata and then sampling the resulting model. The mock data allows users to debug their scripts whenwriting software to formulate their variables of interest. A very preliminary version of the platform is

5

6/30/13 francky.me/mit/moocdb/all/user_certifcate_per_country_normalized_cutoff100.html

francky.me/mit/moocdb/all/user_certifcate_per_country_normalized_cutoff100.html 1/1

Percentage of students who got a certificate

(a) 6.002x: MIT/MITx (b) Crypto 1: Stanford/Coursera

Figure 4: Left plot shows, via coloring, the ratio of certificate winners to the number of registrants on a percountry basis for 6.002x offered via edX platform. Hungary (16.21%), Spain (14.55%) and Latvia (14.40%)are the highest. Right plot shows, via coloring, the ratio of certificate winners to the number of registrants ona per country basis for the Stanford cryptography course offered via Coursera. Russia (17.24%), Netherlands(16.43%) and Germany (12.95%) are highest.

Coursera course edX CourseTitle Cryptography I Circuits and Electronics (6.002x)

Instructors Dan Boneh Anant Agarwal, Gerald Sussman, Piotr MitrosUniversity Stanford University MITLength 6 weeks 14 weeksPlatform Coursera edXStart date Jan 13th, 2013 March 5, 2012Registrants 21,744 154,763

Table 1: Courses overview

allows researchers to refine the scripts. This framework is called MOOCViz. For its current appearancesee Figure 6. MOOCViz is currently available at http://moocviz.csail.mit.edu/.

MOOCViz, when completed, will allow analytic and visualization scripts for MOOC data analyses to beshared, under the understanding that the data being analyzed is organized according to the MOOCDBdata model. MOOCViz activity entails developing a set of templates and support for software sharingplus developing a means of sharing MOOC analytics results through web-based galleries.

Initial design of feature foundry: Another web based platform currently under development will allowthe community to propose variables of interest to be extracted from the data. It is targeted to enable the“crowd” (for example, students taking database courses) to extract and operationalize these variables.This platform will make mock data available. This data is generated by statistical modeling the realdata and then sampling the resulting model. The mock data allows users to debug their scripts whenwriting software to formulate their variables of interest. A very preliminary version of the platform is

5

2

4

MOOC Visual Analytics: Empowering Students, Teachers, Researchers, and Platform Developers of Massively Open Online Courses

Analysis Typesvs.

User Needs

AbstractMassively open online courses (MOOCs) offer instructors the opportunity to reach stu-dents in orders of magnitude greater than they could in traditional classroom settings, while offering students access to free or inexpensive courses taught by world-class ed-ucators. However, MOOCs provide major challenges to teachers (keeping track of thou-sands of students and supporting their learning progress), students (keeping track of course materials and effectively interacting with teachers and fellow students), research-ers (understanding how students interact with materials and each other), and MOOC platform developers (supporting effective course design and delivery in a scalable way).

Along with these challenges, the sheer volume of data available from MOOCs provides unprecedented opportunities to study how learning takes place in online courses. This paper explores the use of data analysis and visualization as a means to empower teach-ers, students, researchers, and platform developers by making large volumes of data easy to understand. First, we introduce the insight needs of these four user groups. Sec-ond, we review existing MOOC visual analytics studies. Third, we present a framework for MOOC data types and data analyses to support different types of insight needs. Fourth, we present exemplary data visualizations that make data accessible and empow-er teachers, students, developers, and researchers with novel insights. The outlook dis-cusses future MOOC opportunities and challenges.

1. StatisticsLine graphs, correlation graphs, and box-and-whisker plots are all exam-ples of how statistical data can be rendered visually.

3. GeospatialGeospatial data might be examined at different levels of aggregation: by address, city, country, or IP address.

2. TemporalTemporal analyses and visualizations tell when students are active over the span of a course. Data might be examined at different levels of aggre-gation: by minute, hour, day, week, or semester, by course modules, or be-fore and after a midterm or final.

4. TopicalTopical analysis provides an answer to the question of “what” is going on in a course.

5. NetworkStudent cohorts might be created based on prior expertise, geospatial region or time zone, access patterns, project teams, or grades.

A. StudentStudents taking MOOCs need to be extremely or-ganized and disciplined. MOOCs have no weekly in-class teacher encounters. MOOCs also have much less peer-pressure. Courses might have vastly different schedules, activities such as labs and capstone projects, deadlines, and grading rubrics. Effectively using one or more MOOC platforms can itself be a major learning exercise. Plus, many students are not used to collaborating with students from different disciplinary back-grounds and cultures that speak different native languages and live in different time zones.

D. Platform DeveloperPlatform developers need to design systems that support effective course design, efficient teach-ing, and secure but scalable course delivery. They need to support times of high traffic and re-source consumption and schedule maintenances during low activity times.

C. ResearcherResearchers that study human learning are keen to understand what teaching and learning meth-ods work well in a MOOC environment and now have massive amounts of detailed data with which to work. As all student interactions—with learning materials, teachers, and other students—are re-corded in a MOOC, human learning can be studied at an extreme level of detail. Many MOOC teach-ers double as learning researchers as they are in-terested to make their own MOOC course work for different types of students.

B. TeacherTeachers of MOOCs need effective means to keep track of and guide the activities, progress, and any problems encountered by thousands of students. They need to understand the ef-fectiveness of materials, exercises, and exams with respect to learning goals in order to contin-uously improve course schedules, activities, and grading rubrics.

Scott R. EmmonsRobert P. LightKaty Börner

DataDemographic DataGeneral student demographics, including age, gender, language, education level, and lo-cation. Demographic data is commonly acquired during the registration process, and ad-ditional demographic data can be acquired via feedback surveys.

Performance DataStudent performance based on graded assessments. This is generally collected from homework, quizzes, and examinations, but it also includes results from pre-course sur-veys designed to examine student knowledge before they take the course.

Activity DataHow students are using class resources, such as the time and date of watching videos, reading material, turning in homework, taking quizzes, or using the discussion forum. Most platforms break down usage by content and media type (i.e. page views, assign-ment views, textbook views, video views). Path through content via inbound and out-bound links is important for understanding learning trajectories.

Feedback DataStudent input and feedback. Feedback data allows course providers to learn more about student learning goals and motivation, intended use of course, and content hoped to learn. It data also contains information about what students liked or disliked in terms of course content, structure, grading, and teacher interaction.

Data Type Data Field edX Canvas Coursera GCB

Demographics

EmailGenderAge / Birth YearLocationLevel Education

DDDDD

DEXXXX

DEDEDED

D*

DEXX

GAX

Performance

Class GradesQuiz / Question BreakdownStudent Breakdown

DD

D

DD

D

DD

DE

DD

DE

Activity

Test / Assignment CompletionContent Usage BreakdownPath Through ContentTime-stamped Activity"Active" Student CountStudent Breakdown

D

DE

DEDED

DE

D

D

DEDDD

D

D

DEDED

DE

D**

GA

GAGAXX

Feedback Supports Surveys ✓ ✓ ✓ ✓

D = Dashboard DE = Data Export GA = Google Analytics*If students choose to take optional demographic survey **If toggled to record

AcknowledgementsWe would like to thank Samuel T. Mills for re-designing the figures in this paper and all 2013 and 2014 IVMOOC students for their feedback and comments, enthusiasm, and support. All R code and all Sci2 Tool workflows are available and documented at cns.iu.edu/2015-MOOCVis.