data science at udemy

Ankara Big Data Meetup - Bilkent Cyberpark, August 5, 2015

Data Science at UdemyLarry Wai

Principal Data Scientist @Udemy


Overview of talk

● What is data science?● Udemy in a nutshell● Data science projects at Udemy● Data science work cycle● What does it mean to be a data scientist?


What is data science?data science in consumer internet = application of the scientific method using big data computational methods to ascertain, predict, and utilize user behavior for business purposes

Inherits from three historical schools of thought

1. Research of natural phenomena using the scientific method○ e.g. physics, astronomy○ data science arises from substituting the study of natural phenomena with study of user behavior

2. Research of computational methods○ e.g. mathematics, computer science○ data science arises from pushing the limits of existing methods to compute that which could not be

computed before3. Research of human behavior

○ e.g. economics, psychology○ data science arises from applying big data to the study of microscopic human behavior, i.e. millions of

users x thousands of items = billions of user-item calculations

Other definitions (too general IMO):

● data science > statistics (only); stats does not require engineering skills● data science > computer science (only); engineering does not require training in the scientific method● data science > business analytics (only); analytics does not require engineering skills nor training in the scientific

method


Udemy in a nutshell● consumer online education marketplace● instructors get 50% of enrollment fee● no certification requirements● typical enrollment price point (paid) is $20-$40● get to critical mass (instructors and students)

in each language through marketing● above critical mass, leverage marketplace

(organic) driven growth● Udemy currently has ~7 million students, ~30

thousand courses● relevance of search and recommendations is

key to fostering growth● learning goal data science is key to fostering

long term growth Google search trends for selected online education companies

● Udemy (blue). Exponential marketplace growth.● Coursera (yellow), Udacity (red), Lynda (green).

Incremental growth.● note: this chart convinced me to join Udemy :)


Udemy web site


Data science projects at Udemysearch & recommendation● real time recommendation (web, mobile)● real time search● batch e-mail recommendation

learning goals● course learning process optimization● learning goal paths● career learning goals

+ more projects


Search and recommendation (in experiment)

Feature classes● course historical averages● personal historical behavior● search term matching

Overall ranking strategy● compute global score per visitor per

course per day● consider modules as filters on the total

available inventory● the module score will be the sum of the

global course scores for the top N courses in the module

● individual courses are ranked within each module according to the global course score

course 1 course 2 course 3 course 4



module A

module B

module C


Learning goals (conceptual stage)

Course learning goal clustering● goals are hierarchical● goals are linked● goals are dynamic

Overall learning goal strategy● continuously update learning goal

clustering● quantify and evaluate student progress

towards learning goals● identify learning goal paths according

to desired careers or hobbies

goal 1 goal 2 goal 3

goal 4 goal 5 goal 6course A

course B


Data science work cycle

experiment setup

exploratory analysis

model deployment

model building

data collection ideal cycling time

is ~days to ~weeks


Exploratory analysis● data to be explored can in general be defined

as a multi-dimensional cube, a.k.a. “hypercube”, where each side of the hypercube is an exploratory “dimension” and the “measures” of the user behavior are aggregates in each cell

● the hypercube is the minimal representation required for the exploratory analysis; e.g. we minimize cardinality for continuous variables

● the human mind is unable to easily comprehend more than 3 dimensions, therefore exploratory analysis must be broken down into actions which project the entire hypercube onto different dimensions in sequence

● goal for the analyst is to understand the multi-dimensional user behavior, which may take many projections in sequence (~100)


model building

● platforms such as R allow us to leverage open source modeling packages and compare models with relatively low overhead

● most user behavior features are non-linear and correlated; thus, the simplest “black box” non-linear models which handle correlations are practical to use, e.g. decision trees

● use residuals on holdout to validate model

model


model deployment

● standardized predictive model markup language (PMML) allows abstraction of models in deployment● “plug-in” model deployment is agile because no new production code is needed for model updates● shifts focus of algo development from production code development to data mining methods● this approach allows a single person to build and deploy models quickly● this approach is cutting edge and is being tested now at Udemy

create training dataset

create predictive model, e.g. decision trees, random forest

offline analysis;residuals;feature importance

loop through courses, compute feature vector per course

compute score per course

sort by score

predictive model store (PMML format)

in memory model;load on initialization;periodic updates

model building

model deployment

model storage

model scoring


experiment setupPractical requirements for experiments, a.k.a. A/B tests

● need enough users to measure an interesting effect

● conversely, if an effect is not large enough to measure, then it is not interesting, at least from a data science point of view, and potentially from a business point of view

● e.g. an interesting effect from a business point of view would be +5% relative lift of conversion rate

● to achieve +5% relative lift at 95% confidence level (on say typical 1 conversion per 10 sessions), need to have 30,000 sessions in each of A and B samples, i.e. >60,000 sessions

● ideally, would like to measure lift within ~days; so need >60,000 sessions per day

● Udemy currently has >200,000 sessions per day (but 2 years ago it was more like 20,000 sessions per day, so 10x slower to run experiments)

1. smoke test (~few days)○ 1% for test variant(s)○ verify that nothing is broken○ 40% CONTROL_1, 40% CONTROL_2○ validate that control is setup correctly

2. initial ramp (~1 week)○ 5-10% for test variant(s)○ sizing depends upon whether we’ve tested

something like this before, and any revenue concerns

3. intermediate ramp (~few weeks)○ 25%-50% for test variant○ 40%-50% for CONTROL_1

4. final ramp / launch○ 90% for test variant○ 10% for CONTROL_1 (optional); turn off

after a few weeks of monitoring○ rename “test” as new baseline


data collection● data should be collected at the most granular

level, e.g. typically per visitor per item per day● data should be pre-arranged in a way which

facilitates fast hypercube production, i.e. star schema

● most granular data is located at the star core● experiment variants can be incorporated as

an additional dimension in one of the star limbs

core table with grouping fields

A, B, C

limb table with grouping field

A

limb table with grouping fields

A, B

limb table with grouping field

B


B, C


A, B, D

mapping table with grouping

field C and other field D

“star schema” (with intermediate mapping)


What does it mean to be a data scientist?A successful data scientist is somebody who can independently execute the entire data science work cycle on the time scale of days to weeks.

Important personal factors● technical chops in math, computational methods, and the scientific method● a genuine research interest in the underlying user behavior● good intuition for how the business works

Important environmental factors● top-down knowledgeability and commitment to data science● excellent data architect● best practices data science infrastructure


Udemy is hiring!https://about.udemy.com/careers/

data science at udemy

Data & Analytics