kaggle talk series top 0.2% kaggler on amazon employee access challenge

AAmmaazzoonn EEmmppllooyyeeee AAcccceessss CChhaalllleennggeePredict an employee's access needs, given his/her job role

Yibo ChenData Scientist @ Supstat Inc

Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html

1 of 65 6/13/14, 2:01 PM

AAggeennddaaIntroduction to the Challenge1.

Look into the Data2.

Model Building3.

Summary4.

2/65


2 of 65 6/13/14, 2:01 PM

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe storyhttp://www.kaggle.com/c/amazon-employee-access-challengeit is all about the access we need to fulfill our daily work.

3/65


3 of 65 6/13/14, 2:01 PM

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe missionbuild an auto-access model based on the historical datato determine the access privilege according to the employee's job role and the resource he appliedfor

4/65


4 of 65 6/13/14, 2:01 PM

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe dataThe data consists of real historical data collected from 2010 & 2011.Employees are manually allowed or denied access to resources over time.

the filestrain.csv - The training set. Each row has the ACTION (ground truth), RESOURCE, andinformation about the employee's role at the time of approval

test.csv - The test set for which predictions should be made. Each row asks whether anemployee having the listed characteristics should have access to the listed resource.

·

·

5/65


5 of 65 6/13/14, 2:01 PM

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe variablesCOLUMN NAME DESCRIPTION

ACTION ACTION is 1 if the resource was approved, 0 if the resource was not

RESOURCE An ID for each resource

MGR_ID The EMPLOYEE ID of the manager of the current EMPLOYEE ID record

ROLE_ROLLUP_1 Company role grouping category id 1 (e.g. US Engineering)

ROLE_ROLLUP_2 Company role grouping category id 2 (e.g. US Retail)

ROLE_DEPTNAME Company role department description (e.g. Retail)

ROLE_TITLE Company role business title description (e.g. Senior Engineering Retail Manager)

ROLE_FAMILY_DESC Company role family extended description (e.g. Retail Manager, Software Engineering)

ROLE_FAMILY Company role family description (e.g. Retail Manager)

ROLE_CODE Company role code; this code is unique to each role (e.g. Manager)

6/65


6 of 65 6/13/14, 2:01 PM

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metricAUC(area under the ROC curve)

is a metric used to judge predictions in binary response (0/1) problem

is only sensitive to the order determined by the predictions and not their magnitudes

package verification or ROCR in R

·

·

·

7/65


7 of 65 6/13/14, 2:01 PM

IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric

(t <- data.frame(true_label=c(0,0,0,0,1,1,1,1),

predict_1=c(1,2,3,4,5,6,7,8),

predict_2=c(1,2,3,6,5,4,7,8),

predict_3=c(1,7,6,4,5,3,2,8)))

## true_label predict_1 predict_2 predict_3

## 1 0 1 1 1

## 2 0 2 2 7

## 3 0 3 3 6

## 4 0 4 6 4

## 5 1 5 5 5

## 6 1 6 4 3

## 7 1 7 7 2

## 8 1 8 8 8

8/65


8 of 65 6/13/14, 2:01 PM


P:4N:4TP:2FP:1TPR=TP/P=0.5FPR=FP/N=0.25

table(t$predict_2 >= 6, t$true_label)

##

## 0 1

## FALSE 3 2

## TRUE 1 2

9/65


9 of 65 6/13/14, 2:01 PM


P:4N:4TP:3FP:1TPR=TP/P=0.75FPR=FP/N=0.25

table(t$predict_2 >= 5, t$true_label)

##

## 0 1

## FALSE 3 1

## TRUE 1 3

10/65


10 of 65 6/13/14, 2:01 PM


11/65


11 of 65 6/13/14, 2:01 PM


require(ROCR, quietly = T)

pred <- prediction(t$predict_1, t$true_label)

performance(pred, "auc")@y.values[[1]]

## [1] 1

require(verification, quietly = T)

roc.area(t$true_label, t$predict_1)$A

## [1] 1


perf <- performance(pred, "tpr", "fpr")

plot(perf, col = 2, lwd = 3)

12/65


12 of 65 6/13/14, 2:01 PM




## [1] 0.875


## [1] 0.875




13/65


13 of 65 6/13/14, 2:01 PM




## [1] 0.5


## [1] 0.5




14/65


14 of 65 6/13/14, 2:01 PM

LLooookk iinnttoo tthhee DDaattaaload data from files

15/65


15 of 65 6/13/14, 2:01 PM

LLooookk iinnttoo tthhee DDaattaathe target

table(y, useNA = "ifany")

## y

## 0 1 <NA>

## 1897 30872 58921

16/65


16 of 65 6/13/14, 2:01 PM

LLooookk iinnttoo tthhee DDaattaathe predictor

17/65


17 of 65 6/13/14, 2:01 PM

LLooookk iinnttoo tthhee DDaattaatreat the features as Categorical or Numerical?

sapply(x, function(z) {

length(unique(z))

})

## resource mgr_id role_rollup_1 role_rollup_2

## 7518 4913 130 183

## role_deptname role_title role_family_desc role_family

## 476 361 2951 68

## role_code

## 361

18/65


18 of 65 6/13/14, 2:01 PM

LLooookk iinnttoo tthhee DDaattaapar(mar = c(5, 4, 0, 2))

plot(x$role_title, x$role_code)

19/65


19 of 65 6/13/14, 2:01 PM

LLooookk iinnttoo tthhee DDaattaalength(unique(x$role_title))

## [1] 361

length(unique(x$role_code))

## [1] 361

length(unique(paste(x$role_code, x$role_title)))

## [1] 361

20/65


20 of 65 6/13/14, 2:01 PM

LLooookk iinnttoo tthhee DDaattaax <- x[, names(x) != "role_code"]

sapply(x, function(z) {

length(unique(z))

})

## resource mgr_id role_rollup_1 role_rollup_2

## 7518 4913 130 183

## role_deptname role_title role_family_desc role_family

## 476 361 2951 68

21/65


21 of 65 6/13/14, 2:01 PM

LLooookk iinnttoo tthhee DDaattaacheck the distribution - role_family_desc

hist(train$role_family_desc, breaks = 100) hist(test$role_family_desc, breaks = 100)

22/65


22 of 65 6/13/14, 2:01 PM

LLooookk iinnttoo tthhee DDaattaacheck the distribution - resource

hist(train$resource, breaks = 100) hist(test$resource, breaks = 100)

23/65


23 of 65 6/13/14, 2:01 PM

LLooookk iinnttoo tthhee DDaattaacheck the distribution - mgr_id

hist(train$mgr_id, breaks = 100) hist(test$mgr_id, breaks = 100)

24/65


24 of 65 6/13/14, 2:01 PM

LLooookk iinnttoo tthhee DDaattaatreat the features as Categorical or Numerical?YetiMan shared his findings in the forum:

1) My analyses so far leads me to believe that there is "information" in some of the categoricallabels themselves. My hunch is that they imply some sort of chronology, but I can't be certain.

2) Just for fun I increased the max classes for R's gbm package to 8192 and built a model (usingplain vanilla training data). The leader board result was 0.87 - slightly worse than the all-numericgbm. Food for thought.

·

·

25/65


25 of 65 6/13/14, 2:01 PM

LLooookk iinnttoo tthhee DDaattaaour approach

treat all features as Categorical1.

treat all features as Numerical2.

treat mgr_id as Numerical, the others as Categorical3.

26/65


26 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinnggworkflow

Feature Extraction

Base Learners

Ensemble

·

·

·

27/65


27 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinnggworkflow

28/65


28 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinnggFeature Extraction

the raw features(as numerical)1.

the raw features(as categorical) with level reduction2.

the dummies(in sparse Matrix)3.

the dummies including the interaction4.

some derived variables(count & ratio)5.

29/65


29 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinngg1. the raw features(as numerical)

30/65


30 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinngg2. the raw features(as categorical) with level reduction2.1 choose the top frequency categories

VAR_RAW FREQUENCY VAR_WITH_LEVEL_REDUCTION

a 3 a

a 3 a

a 3 a

b 2 b

b 2 b

c 1 other

d 1 other

for (i in 1:ncol(x)) {

the_labels <- names(sort(table(x[, i]), decreasing = T)[1:2])

x[!x[, i] %in% the_labels, i] <- "other"

}

31/65


31 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinngg2. the raw features(as categorical) with level reduction2.2 use Pearson's Chi-squared Test

table(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))

##

## mgr_770 mgr_not_770

## 0 5 1892

## 1 147 30725

chisq.test(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))$p.value

## [1] 0.2507

32/65


32 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinngg3. the dummies(in sparse Matrix)ID VAR VAR_A VAR_B VAR_C

1 a 1 0 0

2 a 1 0 0

3 a 1 0 0

4 b 0 1 0

5 c 0 0 1

33/65


33 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinngg3. the dummies(in sparse Matrix)use package Matrix to create the dummies

require(Matrix)

set.seed(114)

Matrix(sample(c(0, 1), 40, re = T, prob = c(0.6, 0.1)), nrow = 5)

## 5 x 8 sparse Matrix of class "dgCMatrix"

##

## [1,] . . . 1 . . . 1

## [2,] . 1 . . . . 1 .

## [3,] 1 . . . . . . .

## [4,] . . . . . 1 . .

## [5,] . . . . . . . .

34/65


34 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinngg4. the dummies including the interactionID M N MN_AP MN_AQ MN_BP MN_BQ

1 a p 1 0 0 0

2 a p 1 0 0 0

3 a q 0 1 0 0

4 b p 0 0 1 0

5 b q 0 0 0 1

35/65


35 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinngg5. some derived variables(count & ratio)

the frequency of every category

the frequency of the interactions

the proportion

·

·

·

36/65


36 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinngg5. some derived variables(count & ratio)

tmp1 <- cnt_1[114:117, c('c1_resource', 'c1_role_deptname')]

tmp2 <- cnt_2[114:117, c('c2_resource_role_deptname_cnt_ij',

'c2_resource_role_deptname_ratio_i',

'c2_resource_role_deptname_ratio_j')]

cbind(tmp1, tmp2)

## c1_resource c1_role_deptname c2_resource_role_deptname_cnt_ij

## 114 1 1645 1

## 115 36 1312 4

## 116 45 465 24

## 117 374 2377 169

## c2_resource_role_deptname_ratio_i c2_resource_role_deptname_ratio_j

## 114 1.0000 0.0006079

## 115 0.1111 0.0030488

## 116 0.5333 0.0516129

## 117 0.4519 0.0710980

37/65


37 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinnggBase Learners

Regularized Generalized Linear Model1.

Support Vector Machine2.

Random Forest3.

Gradient Boosting Machine4.

38/65


38 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinnggEnsemble

mean prediction of all models1.

two-stage stacking2.

based on 5-fold cv holdout predictions·

39/65


39 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinnggEnsemble

mean prediction of all models1.

two-stage stacking2.

based on 5-fold cv holdout predictions

algorithms in level-1(Regularized Generalized Linear Model & Gradient Boosting Machine)

algorithms in level-2(Regularized Generalized Linear Model)

·

·

·

40/65


40 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinngg1. Regularized Generalized Linear Model

generalized linear model(glm)

convex penalties

·

·

41/65


41 of 65 6/13/14, 2:01 PM


logistic regression·

x <- sort(rnorm(100))

set.seed(114)

y <- c(sample(x=c(0,1),size=30,prob=c(0.9,0.1),re=T),

sample(x=c(0,1),size=20,prob=c(0.7,0.3),re=T),

sample(x=c(0,1),size=20,prob=c(0.3,0.7),re=T),

sample(x=c(0,1),size=30,prob=c(0.1,0.9),re=T))

m1 <- lm(y~x)

m2 <- glm(y~x,family=binomial(link=logit))

y2 <- predict(m2,data=x,type='response')

par(mar=c(5,4,0,0))

plot(y~x);abline(m1,lwd=3,col=2)

points(x,y2,type='l',lwd=3,col=3)

42/65


42 of 65 6/13/14, 2:01 PM


logistic regression·

convex penalties·

43/65


43 of 65 6/13/14, 2:01 PM


convex penalties·

L1 (lasso)

L2 (ridge regression)

mixture of L1&L2 (elastic net)

-

-

-

44/65


44 of 65 6/13/14, 2:01 PM


the dummies(in sparse Matrix)

the dummies including the interaction

R package:glmnet

·

·

·

45/65


45 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinngg2. Support Vector Machine(just for Diversity)

46/65


46 of 65 6/13/14, 2:01 PM


47/65


47 of 65 6/13/14, 2:01 PM


48/65


48 of 65 6/13/14, 2:01 PM


the dummies including the interaction

some derived variables(count & ratio)

R package:kernlab,e1071

·

·

·

49/65


49 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinnggdecision tree

50/65


50 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinngg3. Random Forestdecision trees + bagging

51/65


51 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinngg3. Random Forest

the raw features(as numerical)

the raw features(as categorical) with level reduction


R package:randomForest

·

·

·

·

52/65


52 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinngg4. Gradient Boosting Machinedecision trees + boosting

53/65


53 of 65 6/13/14, 2:01 PM

MMooddeell BBuuiillddiinngg4. Gradient Boosting Machine

the raw features(as numerical)

the raw features(as categorical) with level reduction


R package:gbm

·

·

·

·

54/65


54 of 65 6/13/14, 2:01 PM

SSuummmmaarryysome insightsVARIABLE NAME REL.INF

cnt2_resource_role_deptname_cnt_ij 2.542974017

cnt2_resource_role_rollup_2_ratio_i 2.107624216

cnt2_resource_role_deptname_ratio_j 2.017153645

cnt2_resource_role_rollup_2_ratio_j 1.910465811

cnt2_resource_role_family_ratio_i 1.770737494

... ...

cnt4_resource_mgr_id_role_rollup_2_role_family_desc 0.008938286

cnt4_resource_role_rollup_1_role_rollup_2_role_title 0.008930661

cnt4_resource_mgr_id_role_rollup_1_role_family_desc 0.002106958

55/65


55 of 65 6/13/14, 2:01 PM

SSuummmmaarryysome insights

summary(x[, c('cnt2_resource_role_deptname_cnt_ij',

'cnt2_resource_role_deptname_ratio_j')])

## cnt2_resource_role_deptname_cnt_ij cnt2_resource_role_deptname_ratio_j

## Min. : 1.0 Min. :0.0003

## 1st Qu.: 2.0 1st Qu.:0.0061

## Median : 7.0 Median :0.0172

## Mean : 15.6 Mean :0.0315

## 3rd Qu.: 17.0 3rd Qu.:0.0368

## Max. :201.0 Max. :1.0000

56/65


56 of 65 6/13/14, 2:01 PM


xx <- x[, 'cnt2_resource_role_deptname_cnt_ij']

tt <- t.test(xx ~ y)

list(estimate=tt$estimate,

conf.int=tt$conf.int, p.value=tt$p.value)

## $estimate

## mean in group 0 mean in group 1

## 10.04 13.82

##

## $conf.int

## [1] -4.851 -2.710

## attr(,"conf.level")

## [1] 0.95

##

## $p.value

## [1] 5.838e-12

par(mar=c(5,4,2,2))

boxplot(xx ~ y)

57/65


57 of 65 6/13/14, 2:01 PM


xxx <- cut(xx, include.lowest=T,

breaks=c(0,1,3,7,14,30,300))

par(mar=c(5,2,0,0))

barplot(table(xxx))

tb <- table(y, xxx)

r_0 <- tb[1, ] / colSums(tb)

par(mar=c(5,2,0,0))

plot(r_0, type='l', lwd=3)

58/65


58 of 65 6/13/14, 2:01 PM


xx <- x[, 'cnt2_resource_role_deptname_ratio_j']

tt <- t.test(xx ~ y)

list(estimate=tt$estimate,

conf.int=tt$conf.int, p.value=tt$p.value)

## $estimate

## mean in group 0 mean in group 1

## 0.01955 0.02902

##

## $conf.int

## [1] -0.011732 -0.007205

## attr(,"conf.level")

## [1] 0.95

##

## $p.value

## [1] 3.93e-16

par(mar=c(5,4,2,2))

boxplot(xx ~ y)

59/65


59 of 65 6/13/14, 2:01 PM


xxx <- cut(xx, include.lowest=T,

breaks=quantile(xx, seq(0,1,0.2)))

par(mar=c(5,2,0,0))

barplot(table(xxx))

tb <- table(y, xxx)

r_0 <- tb[1, ] / colSums(tb)

par(mar=c(5,2,0,0))

plot(r_0, type='l', lwd=3)

60/65


60 of 65 6/13/14, 2:01 PM

SSuummmmaarryyoverfittingMODEL AUC_CV AUC_PUBLIC AUC_PRIVATE

num_glmnet_0 0.8985069 0.87737 0.87385

stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478

61/65


61 of 65 6/13/14, 2:01 PM

SSuummmmaarryyoverfittingMODEL AUC_CV AUC_PUBLIC AUC_PRIVATE

num_glmnet_0 0.8985069 0.87737 0.87385

stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478

stacking_gbm_without_the_glmnet 0.9182303 0.91529 0.91130

62/65


62 of 65 6/13/14, 2:01 PM

SSuummmmaarryyoverfittingWinning solution code and methodologyhttp://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/5283/winning-solution-code-and-methodology

63/65


63 of 65 6/13/14, 2:01 PM

SSuummmmaarryyuseful discussionsPython code to achieve 0.90 AUC with Logistic Regressionhttp://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4838/python-code-to-achieve-0-90-auc-with-logistic-regression

Starter code in python with scikit-learn (AUC .885)http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4797/starter-code-in-python-with-scikit-learn-auc-885

Patterns in Training data sethttp://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4886/patterns-in-training-data-set

64/65


64 of 65 6/13/14, 2:01 PM

tthhaannkk yyoouu

65/65


65 of 65 6/13/14, 2:01 PM