courses - illinoissifaka.cs.uiuc.edu/~wang296/course/ir_fall/past-projects/1.pdf · t hrough...

13
1 COURSES An MOOC Search Engine Ruilin Xu [email protected] Abstract As there are hundreds of thousands of massive open online courses (MOOCs) over the internet, with more and more every day, it is quite cumbersome for users to search for the specific courses wanted efficiently. Usually people need to repeat the same search on multiple MOOC websites such as Coursera [4] and Udacity [5] in order to find what they are looking for. As a result, COURSES has been developed. COURSES is a MOOC search engine which gathers around 4000 courses from many major MOOC websites such as Coursera [4] , Udacity [5] , Khan Academy [6] , Udemy [7] and Edx [8] . COURSES also realizes faceted search, where all search results are categorized into multiple categories such as course prices, length, workload, instructors, and course categories. Users are able to filter using these categories to target specific courses very efficiently by combining a search query and multiple filters. Introduction Nowadays, people love to enroll in massive open online courses (MOOCs) because of its convenience. As MOOCs become more and more popular, the number of MOOCs increases dramatically. As a result, numerous MOOC websites emerged. Unfortunately, with the number of MOOC websites increasing rapidly, it is more difficult for users to find the courses they want. They often need to repeat their searches again and again on different MOOC websites, trying to find the perfect result. This is both cumbersome and inefficient. Is there a way to search only once and easily find the courses we want from all the MOOC websites? With that question in mind, I was also inspired by Assignment 3; I found that my thought was not impossible to realize. Through discussion with the professor, I decided to create COURSES, a vertical MOOC search engine, powered by Apache Solr, to solve the problem and to apply what I learned from class into the real life. This search engine greatly simplifies the search for suitable online courses, because it combines courses from five sites (i.e., Coursera [4] , Udacity [5] , Khan Academy [6] , Udemy [7] , and Edx [8] ), eliminating the need for users to hop from site to site. Users are also able to filter based on what they are looking for. There are similar websites such as RedHoop [1] , MOOC-List [2] , Class-Central [3] , etc. But they all have their downsides. They either do not incorporate as much data as COURSES or do not display as much information about the courses as the user needs. They only display the title of the course and a brief

Upload: others

Post on 24-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COURSES - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/past-projects/1.pdf · T hrough discussion with the professor, I decided to create COURSES, a vertical MOOC search engine,

1

COURSES An MOOC Search Engine

Ruilin Xu

[email protected]

Abstract As there are hundreds of thousands of massive open online courses (MOOCs) over the internet, with

more and more every day, it is quite cumbersome for users to search for the specific courses wanted

efficiently. Usually people need to repeat the same search on multiple MOOC websites such as

Coursera[4] and Udacity[5] in order to find what they are looking for. As a result, COURSES has been

developed. COURSES is a MOOC search engine which gathers around 4000 courses from many major

MOOC websites such as Coursera[4], Udacity[5], Khan Academy[6], Udemy[7] and Edx[8]. COURSES also

realizes faceted search, where all search results are categorized into multiple categories such as course

prices, length, workload, instructors, and course categories. Users are able to filter using these

categories to target specific courses very efficiently by combining a search query and multiple filters.

Introduction Nowadays, people love to enroll in massive open online courses (MOOCs) because of its convenience. As

MOOCs become more and more popular, the number of MOOCs increases dramatically. As a result,

numerous MOOC websites emerged. Unfortunately, with the number of MOOC websites increasing

rapidly, it is more difficult for users to find the courses they want. They often need to repeat their

searches again and again on different MOOC websites, trying to find the perfect result. This is both

cumbersome and inefficient. Is there a way to search only once and easily find the courses we want

from all the MOOC websites?

With that question in mind, I was also inspired by Assignment 3; I found that my thought was not

impossible to realize. Through discussion with the professor, I decided to create COURSES, a vertical

MOOC search engine, powered by Apache Solr, to solve the problem and to apply what I learned from

class into the real life. This search engine greatly simplifies the search for suitable online courses,

because it combines courses from five sites (i.e., Coursera[4], Udacity[5], Khan Academy[6], Udemy[7], and

Edx[8]), eliminating the need for users to hop from site to site. Users are also able to filter based on what

they are looking for.

There are similar websites such as RedHoop[1], MOOC-List[2], Class-Central[3], etc. But they all have their

downsides. They either do not incorporate as much data as COURSES or do not display as much

information about the courses as the user needs. They only display the title of the course and a brief

Page 2: COURSES - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/past-projects/1.pdf · T hrough discussion with the professor, I decided to create COURSES, a vertical MOOC search engine,

2

course introduction. The user cannot see any other attributes of the course, such as its instructors, its

price, its length etc.

As a result, COURSES is both very meaningful and useful. COURSES has a big database, around 4,000

courses as its data. COURSES also have many useful faceted filters for users to target search results

efficiently. COURSES also displays all of the useful attributes of the courses within the search results so

that users can see everything in one location, which can save users a significant amount of time in

determining which courses to take.

Related Work COURSES is inspired by Assignment 3 from our CS 410 course. Assignment 3 is basically a simple tutorial

on building a simple search engine from some data sources. COURSES is similar to Assignment 3,

because it uses a similar technique when parsing data. Both use the combination of Ruby and JavaScript

files. The difference is that COURSES uses JavaScript parsers that are much more complicated than the

one used in Assignment 3. COURSES’ parsers are web-specific, meaning that they are able to parse

different data from different MOOC websites. After parsing data, the data will then be converted to XML

files, which are compatible with Apache Solr.

There is another well-implemented MOOC search engine called RedHoop[1]. RedHoop[1] and COURSES

are very similar in the sense that they are both multi-site search engines, meaning that they all get

course data from various different MOOC websites. In addition, they both have the functionality of

faceted search. However, COURSES improved greatly in terms of displaying search results. RedHoop [1]

only displays the course title with a brief introduction of the course, whereas COURSES display much

more useful information such as price, course length, estimated workload, course language and

instructors’ information. COURSES also has more search faceted filters such as course language and

instructors, which are very important for international users who want to take courses in their own

language and for users who have a strong preference over certain instructors.

Page 3: COURSES - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/past-projects/1.pdf · T hrough discussion with the professor, I decided to create COURSES, a vertical MOOC search engine,

3

Problem Definition The challenge that I solved was to create an online search engine for courses that draws from multiple

sources and assists the user to more efficiently search using different facets. The input for the user is his

search query and any filters they want to apply. The expected output is the list of search results, sorted

by relevance.

Building this search engine has four sub-challenges/stages, which I will now enumerate.

1. Data Crawling & Parsing

Because I needed to aggregate all the data from various sources, the first problem faced was how to

best parse and get the data. As we all know, each website has its own data format which might be very

different from others. For example, courses from Coursera[4] don’t have a price field, since all of them

are free. On the other hand, however, courses from Udacity[5] do have prices listed.

2. Data Processing & Consolidation

With all the data correctly parsed, another problem immediately emerged. Since I was building a single

source search engine which can handle data from numerous different websites, I needed to design a

good data structure which could easily take and consolidate all the data. I needed to think about what

key attributes a course should have so that I could use and apply this structure on all the websites

COURSES gets data from.

3. Data Formatting & Outputting After consolidating the data, it’s time to output the data into Apache Solr. To do this, I needed to find a

way that could easily convert the raw data into the standard format such as XML or JSON files that

Apache Solr can read.

4. User Interface Design & Implementation

The last step, after inserting all the data into Apache Solr, was to design and implement a great user

interface which is sufficiently informative and correctly displays all the important data users need in an

aesthetically pleasing way.

Page 4: COURSES - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/past-projects/1.pdf · T hrough discussion with the professor, I decided to create COURSES, a vertical MOOC search engine,

4

Methods According to the problems mentioned in the above section, I will provide my solution in detail here.

COURSES is based on Apache Solr, which is a great framework for building vertical search engines.

Although personally I think Solr is not that easy to use, since it doesn’t have much detailed introductory

documentation. To get the wanted data, I needed to write my own parser. I did that based on the

crawler and parser from our Assignment 3. I also modified it so that it generates the data XML files that

Solr can read. For the front end user interface, I needed to make sure the data is correctly displayed and

the user interface is pleasant to the eye.

1. Data Crawling & Parsing

As mentioned above, since I needed to get data from all kinds of MOOC websites, I designed a website-

specific parser for each website, so that I am able to get all the useful data correctly.

By reading deeply into each website via inspection tool, I came up with the following table of

commands, which correctly parses the data from each website:

Coursera:

Title document.title

Website document.title.substring(document.title.indexOf("|")+2)

Length document.body.getElementsByClassName("icon-

calendar")[0].parentNode.childNodes[1].innerText

Workload document.body.getElementsByClassName("icon-

time")[0].parentNode.childNodes[1].innerText +

document.body.getElementsByClassName("icon-

time")[0].parentNode.childNodes[2].innerText

Language document.body.getElementsByClassName("icon-

globe")[0].parentNode.childNodes[1].innerText

Instructor document.body.getElementsByClassName("coursera-course2-instructors-profile")[i].childNodes[2].childNodes[0].getElementsByTagName("span")[0].innerText – iterate i

Instructor

intro

document.body.getElementsByClassName("coursera-course2-instructors-

profile")[i].childNodes[2].childNodes[1].getElementsByTagName("span")[0].inn

erText – iterate i

Course

categories

document.body.getElementsByClassName("coursera-course-

categories")[0].getElementsByTagName("a")[i].innerText – iterate i

Course intro document.body.getElementsByClassName("span6")[0].innerText

Course body document.body.getElementsByClassName("span7")[0].innerText

Edx:

Title document.title

Website document.title.substring(document.title.indexOf("|")+2)

Length document.body.getElementsByClassName("course-detail-

length")[0].innerText.substring("Course Length: ".length)

Workload document.body.getElementsByClassName("course-detail-

effort")[0].innerText.substring("Estimated effort: ".length)

Instructor document.body.getElementsByClassName("staff-list")[0].getElementsByTagName("li")[i].childNodes[3].childNodes[1].innerText – iterate i

Page 5: COURSES - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/past-projects/1.pdf · T hrough discussion with the professor, I decided to create COURSES, a vertical MOOC search engine,

5

Instructor

intro

document.body.getElementsByClassName("staff-

list")[0].getElementsByTagName("li")[i].childNodes[3].childNodes[3].innerText

– iterate i

Course intro document.body.getElementsByClassName("course-detail-subtitle copy-

lead")[0].innerText

Course body document.body.getElementsByClassName("course-section course-detail-

about")[0].innerText + document.body.getElementsByClassName("view-display-id-

errata")[0].innerText – second part might not exist

Khan:

Title document.title

Website document.title.substring(document.title.indexOf("|")+2)

Course intro document.body.getElementsByClassName("topic-desc")[0].innerText

Course body document.getElementById("page-container-inner").innerText

Udacity:

Title document.title

Website document.title.substring(document.title.indexOf("|")+2)

Price document.body.getElementsByClassName("price-information")[0].innerText (if

contains “null”, then free)

Length document.body.getElementsByClassName("duration-

information")[0].getElementsByClassName("col-md-

10")[0].getElementsByTagName("strong")[0].innerText.substring("Approx.

".length)

Workload document.body.getElementsByClassName("duration-

information")[0].getElementsByClassName("col-md-

10")[0].getElementsByTagName("small")[0].getElementsByTagName("p")[0].innerT

ext.substring("Assumes ".length)

Instructor document.body.getElementsByClassName("row row-gap-medium instructor-

information-entry")[i].childNodes[2j-1].childNodes[1].getElementsByTagName("h3")[0].innerText – iterate i, j (1, 2)

Instructor

intro

document.body.getElementsByClassName("row row-gap-medium instructor-

information-entry")[i].childNodes[2j-

1].childNodes[3].getElementsByTagName("p")[0].innerText – iterate i, j (1,

2)

Course intro document.body.getElementsByClassName("col-md-8 col-md-offset-

2")[1].getElementsByClassName("pretty-format")[0].innerText

Course body document.body.getElementsByClassName("col-md-8 col-md-offset-

2")[i].innerText – iterate i

Udemy:

Title document.title

Website document.title.substring(document.title.indexOf("|")+2)

Price document.body.getElementsByClassName("pb-p")[0].getElementsByClassName("pb-

pr")[0].innerText

Length document.body.getElementsByClassName("wi")[0].getElementsByClassName("wi-

li")[1].innerText.replace(" of high quality content", "")

Page 6: COURSES - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/past-projects/1.pdf · T hrough discussion with the professor, I decided to create COURSES, a vertical MOOC search engine,

6

Instructor document.body.getElementsByClassName("tb-li")[i].childNodes[1].getElementsByClassName("tb-r")[0].getElementsByTagName("a")[0].innerText – iterate i

Instructor

intro

document.body.getElementsByClassName("tb-

li")[i].childNodes[3].getElementsByTagName("p")[0].innerText – iterate i

Course intro document.body.getElementsByClassName("ci-d")[0].innerText

Course body document.body.getElementsByClassName("mc")[0].innerText

2. Data Processing & Consolidation

With the above data parsed, I next designed a structure which can hold all attributes of a course and fit

data from all websites. The following table is the result:

Coursera Edx Khan Udacity Udemy

URL Given Given Given Given Given

Title Parsed Parsed Parsed Parsed Parsed

Website Parsed Parsed Parsed Parsed Parsed

Price DEFAULT: FREE DEFAULT: FREE DEFAULT: FREE Parsed Parsed

Length Parsed Parsed DEFAULT:

Undefined

Parsed Parsed

Workload Parsed Parsed DEFAULT:

Undefined

Parsed DEFAULT:

Undefined

Language Parsed DEFAULT:

Undefined

DEFAULT:

Undefined

DEFAULT:

Undefined

DEFAULT:

Undefined

Instructor Parsed Parsed DEFAULT:

Undefined

Parsed Parsed

Instructor

intro

Parsed Parsed DEFAULT:

Undefined

Parsed Parsed

Course

categories

Parsed DEFAULT:

Undefined

DEFAULT:

Undefined

DEFAULT:

Undefined

DEFAULT:

Undefined

Course intro Parsed Parsed Parsed Parsed Parsed

Course body Parsed Parsed Parsed Parsed Parsed

Page 7: COURSES - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/past-projects/1.pdf · T hrough discussion with the professor, I decided to create COURSES, a vertical MOOC search engine,

7

3. Data Formatting & Outputting

Apache Solr has its own rules of data files. I chose to use its XML rules. With the consolidated data

above, I was then able to create data XML files (one file for each website). After trying all kinds of

options, I finally decided to output data while reading in the data and processing it. The following code

snippet is excerpted from one of the parsers:

var length;

try

{

length = "<field name=\"course_length\">" +

document.body.getElementsByClassName("icon-

calendar")[0].parentNode.childNodes[1].innerText.trim().replace(/&

/g, '&amp;').replace(/</g, '&lt;').replace(/>/g,

'&gt;').replace(/"/g, '&quot;').replace(/'/g, '&#39;') +

"</field>\n\t";

}

catch (err)

{

length = "<field name=\"course_length\">Undefined</field>\n\t";

}

The above code snippet shows the method of getting parsed data, processing it, and then outputting it

into the correct XML format. The data is obtained by using the command shown in the tables above in

the “Data Crawling & Parsing” section. COURSES trims out unnecessary characters, replaces some

special characters such as “<” with its entity reference (“&lt;”), since those characters cannot exist inside

XML field, and puts the processed string into the correct XML field. If data does not exist (i.e., an error

occurred after executing the commands), the field is set at its default value.

Page 8: COURSES - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/past-projects/1.pdf · T hrough discussion with the professor, I decided to create COURSES, a vertical MOOC search engine,

8

4. User Interface Design & Implementation

Apache Solr has its default user interface which uses Velocity Search UI. I went into the

“./config/velocity” folder to modify all the “.vm” files, which is a very complicated job. The page is

formatted inside “layout.vm”. “layout.vm” then calls other files such as “heade r.vm”, “tabs.vm”,

“content.vm”, and “footer.vm”, as well as the CSS file called “main.css”. I commented out the unused

cases and focused on “content.vm”. After trying countless combinations, I came up with the main user

interface, which appears as follows:

From the above screenshot, we can see the COURSES’ logo along with a simple search field and a submit

button. We can also see that on the left side of the web page, there are many facet filters. What ’s great

about those facet filters is that it puts all of the search results into different categories and concisely

displays them to users. Users can then easily filter out unwanted results and get what they want more

quickly and efficiently. The search results are displayed on the right side. It shows the attributes of the

courses found, such as price, course length, estimated workload, course categories, instructors and even

a brief introduction of the course.

Page 9: COURSES - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/past-projects/1.pdf · T hrough discussion with the professor, I decided to create COURSES, a vertical MOOC search engine,

9

Evaluation/Sample Results

1. Experiment & Sample Results

After completing the above stages, I was able to test the system with some sample data. Since the entire

set of data contains around 4000 URLs, it’s very time-consuming to test on the entire data set.

Therefore, I extracted 10 URLs from each website and ran the search engine from that data. I then input

some queries such as “teaching”, “machine learning”, “java”, etc. in order to see what was displayed.

To evaluate the search engine, we need to answer the following two questions:

1. Are the search results correctly displayed? That is, if there exists a search result, we need to be

able to find the following information:

a. Total number of result

b. Search time

c. Search results with the following information:

i. Course title, which is a link to the course webpage

ii. Course webpage’s logo

iii. Course URL

iv. Price

v. Length

vi. Workload

vii. Language

viii. Category

ix. Instructor

x. Instructor intro

xi. Course intro

xii. Course body

2. Does the faceted filter on the left side work? When we click on any filter, does the filter show up

and does it narrow down the search results?

Page 10: COURSES - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/past-projects/1.pdf · T hrough discussion with the professor, I decided to create COURSES, a vertical MOOC search engine,

10

The following shows the results after searching for “machine learning”:

COURSES found 2 results in 31 milliseconds as shown, one course is from Edx [8] and the other one is

from Udacity[5]. We can also see that the course title, URL, price, course length, course workload, course

language, course category, instructor, and other information about the course are all correctly displayed

(with default field values such as “Free” and “Undefined”).

I then applied faceted filters on the left, such as the following:

Page 11: COURSES - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/past-projects/1.pdf · T hrough discussion with the professor, I decided to create COURSES, a vertical MOOC search engine,

11

The filters, “Dave Holtz” and “$150/month,” were successfully applied and the search result was

correctly reduced to one result.

2. User Opinion I demoed the project to one of our TAs. He suggested I improve more on the user interface of the search

engine. Another issue he mentioned is that the highlighting within the search results doesn’t seem to

work well.

3. More Improvements

1. Implement keyword highlighting feature

2. User interface tweaking

Page 12: COURSES - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/past-projects/1.pdf · T hrough discussion with the professor, I decided to create COURSES, a vertical MOOC search engine,

12

Conclusions In creating COURSES, a single-source search engine for online courses, I learned how to parse and collect

data from multiple sources, along with learning how to work with data structures to best present it. I

also learned how to design a better UI through this experience. COURSES has potential, since it greatly

simplifies the process of searching for a suitable course through combining course sources and

implementing filter features to more accurately search.

Future Work In the future, I will try to implement the entire search engine without Solr, which will give me more

control of the data processing and result ranking, etc. Without Solr, I can also freely adjust user interface

elements as needed. Another important improvement would be URL auto discovery and retrieval. There

are probably hundreds or thousands of new courses generated every day. Being able to monitor them

and automatically get any updates from the URL is very important for this kind of MOOC search engine.

Page 13: COURSES - Illinoissifaka.cs.uiuc.edu/~wang296/Course/IR_Fall/past-projects/1.pdf · T hrough discussion with the professor, I decided to create COURSES, a vertical MOOC search engine,

13

References [1] RedHoop. https://www.redhoop.com

[2] MOOC-List. www.mooc-list.com

[3] Class-Central. https://www.class-central.com

[4] Coursera. https://www.coursera.org

[5] Udacity. https://www.udacity.com

[6] Khan Academy. https://www.khanacademy.org

[7] Udemy. https://www.udemy.com

[8] Edx. https://www.edx.org

[9] Sanjeev Mishra. “HowTo? Solr: building a custom search engine”.

http://pixedin.blogspot.com/2012/05/howto-solr-building-custom-search.html

[10] Apache Solr Reference Guide. https://cwiki.apache.org/confluence/display/solr/Velocity+Search+UI

[11] Solr Tutorial. https://lucene.apache.org/solr/4_8_0/tutorial.html