kaggle talk series top 0.2% kaggler on amazon employee access challenge
DESCRIPTION
NYC Data Science Academy, NYC Open Data Meetup, Big Data, Data Science, NYC, Vivian Zhang, SupStat Inc,NYC, Machine learning, Kaggle, amazon employee access challengeTRANSCRIPT
AAmmaazzoonn EEmmppllooyyeeee AAcccceessss CChhaalllleennggeePredict an employee's access needs, given his/her job role
Yibo ChenData Scientist @ Supstat Inc
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
1 of 65 6/13/14, 2:01 PM
AAggeennddaaIntroduction to the Challenge1.
Look into the Data2.
Model Building3.
Summary4.
2/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
2 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe storyhttp://www.kaggle.com/c/amazon-employee-access-challengeit is all about the access we need to fulfill our daily work.
3/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
3 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe missionbuild an auto-access model based on the historical datato determine the access privilege according to the employee's job role and the resource he appliedfor
4/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
4 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe dataThe data consists of real historical data collected from 2010 & 2011.Employees are manually allowed or denied access to resources over time.
the filestrain.csv - The training set. Each row has the ACTION (ground truth), RESOURCE, andinformation about the employee's role at the time of approval
test.csv - The test set for which predictions should be made. Each row asks whether anemployee having the listed characteristics should have access to the listed resource.
·
·
5/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
5 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe variablesCOLUMN NAME DESCRIPTION
ACTION ACTION is 1 if the resource was approved, 0 if the resource was not
RESOURCE An ID for each resource
MGR_ID The EMPLOYEE ID of the manager of the current EMPLOYEE ID record
ROLE_ROLLUP_1 Company role grouping category id 1 (e.g. US Engineering)
ROLE_ROLLUP_2 Company role grouping category id 2 (e.g. US Retail)
ROLE_DEPTNAME Company role department description (e.g. Retail)
ROLE_TITLE Company role business title description (e.g. Senior Engineering Retail Manager)
ROLE_FAMILY_DESC Company role family extended description (e.g. Retail Manager, Software Engineering)
ROLE_FAMILY Company role family description (e.g. Retail Manager)
ROLE_CODE Company role code; this code is unique to each role (e.g. Manager)
6/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
6 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metricAUC(area under the ROC curve)
is a metric used to judge predictions in binary response (0/1) problem
is only sensitive to the order determined by the predictions and not their magnitudes
package verification or ROCR in R
·
·
·
7/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
7 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric
(t <- data.frame(true_label=c(0,0,0,0,1,1,1,1),
predict_1=c(1,2,3,4,5,6,7,8),
predict_2=c(1,2,3,6,5,4,7,8),
predict_3=c(1,7,6,4,5,3,2,8)))
## true_label predict_1 predict_2 predict_3
## 1 0 1 1 1
## 2 0 2 2 7
## 3 0 3 3 6
## 4 0 4 6 4
## 5 1 5 5 5
## 6 1 6 4 3
## 7 1 7 7 2
## 8 1 8 8 8
8/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
8 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric
P:4N:4TP:2FP:1TPR=TP/P=0.5FPR=FP/N=0.25
table(t$predict_2 >= 6, t$true_label)
##
## 0 1
## FALSE 3 2
## TRUE 1 2
9/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
9 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric
P:4N:4TP:3FP:1TPR=TP/P=0.75FPR=FP/N=0.25
table(t$predict_2 >= 5, t$true_label)
##
## 0 1
## FALSE 3 1
## TRUE 1 3
10/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
10 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric
11/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
11 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric
require(ROCR, quietly = T)
pred <- prediction(t$predict_1, t$true_label)
performance(pred, "auc")@y.values[[1]]
## [1] 1
require(verification, quietly = T)
roc.area(t$true_label, t$predict_1)$A
## [1] 1
pred <- prediction(t$predict_1, t$true_label)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = 2, lwd = 3)
12/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
12 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric
pred <- prediction(t$predict_2, t$true_label)
performance(pred, "auc")@y.values[[1]]
## [1] 0.875
roc.area(t$true_label, t$predict_2)$A
## [1] 0.875
pred <- prediction(t$predict_2, t$true_label)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = 2, lwd = 3)
13/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
13 of 65 6/13/14, 2:01 PM
IInnttrroodduuccttiioonn ttoo tthhee CChhaalllleennggeethe metric
pred <- prediction(t$predict_3, t$true_label)
performance(pred, "auc")@y.values[[1]]
## [1] 0.5
roc.area(t$true_label, t$predict_3)$A
## [1] 0.5
pred <- prediction(t$predict_3, t$true_label)
perf <- performance(pred, "tpr", "fpr")
plot(perf, col = 2, lwd = 3)
14/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
14 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaaload data from files
15/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
15 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaathe target
table(y, useNA = "ifany")
## y
## 0 1 <NA>
## 1897 30872 58921
16/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
16 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaathe predictor
17/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
17 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaatreat the features as Categorical or Numerical?
sapply(x, function(z) {
length(unique(z))
})
## resource mgr_id role_rollup_1 role_rollup_2
## 7518 4913 130 183
## role_deptname role_title role_family_desc role_family
## 476 361 2951 68
## role_code
## 361
18/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
18 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaapar(mar = c(5, 4, 0, 2))
plot(x$role_title, x$role_code)
19/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
19 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaalength(unique(x$role_title))
## [1] 361
length(unique(x$role_code))
## [1] 361
length(unique(paste(x$role_code, x$role_title)))
## [1] 361
20/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
20 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaax <- x[, names(x) != "role_code"]
sapply(x, function(z) {
length(unique(z))
})
## resource mgr_id role_rollup_1 role_rollup_2
## 7518 4913 130 183
## role_deptname role_title role_family_desc role_family
## 476 361 2951 68
21/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
21 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaacheck the distribution - role_family_desc
hist(train$role_family_desc, breaks = 100) hist(test$role_family_desc, breaks = 100)
22/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
22 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaacheck the distribution - resource
hist(train$resource, breaks = 100) hist(test$resource, breaks = 100)
23/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
23 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaacheck the distribution - mgr_id
hist(train$mgr_id, breaks = 100) hist(test$mgr_id, breaks = 100)
24/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
24 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaatreat the features as Categorical or Numerical?YetiMan shared his findings in the forum:
1) My analyses so far leads me to believe that there is "information" in some of the categoricallabels themselves. My hunch is that they imply some sort of chronology, but I can't be certain.
2) Just for fun I increased the max classes for R's gbm package to 8192 and built a model (usingplain vanilla training data). The leader board result was 0.87 - slightly worse than the all-numericgbm. Food for thought.
·
·
25/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
25 of 65 6/13/14, 2:01 PM
LLooookk iinnttoo tthhee DDaattaaour approach
treat all features as Categorical1.
treat all features as Numerical2.
treat mgr_id as Numerical, the others as Categorical3.
26/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
26 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinnggworkflow
Feature Extraction
Base Learners
Ensemble
·
·
·
27/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
27 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinnggworkflow
28/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
28 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinnggFeature Extraction
the raw features(as numerical)1.
the raw features(as categorical) with level reduction2.
the dummies(in sparse Matrix)3.
the dummies including the interaction4.
some derived variables(count & ratio)5.
29/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
29 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg1. the raw features(as numerical)
30/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
30 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg2. the raw features(as categorical) with level reduction2.1 choose the top frequency categories
VAR_RAW FREQUENCY VAR_WITH_LEVEL_REDUCTION
a 3 a
a 3 a
a 3 a
b 2 b
b 2 b
c 1 other
d 1 other
for (i in 1:ncol(x)) {
the_labels <- names(sort(table(x[, i]), decreasing = T)[1:2])
x[!x[, i] %in% the_labels, i] <- "other"
}
31/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
31 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg2. the raw features(as categorical) with level reduction2.2 use Pearson's Chi-squared Test
table(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))
##
## mgr_770 mgr_not_770
## 0 5 1892
## 1 147 30725
chisq.test(y$y, ifelse(x$mgr_id == 770, "mgr_770", "mgr_not_770"))$p.value
## [1] 0.2507
32/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
32 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg3. the dummies(in sparse Matrix)ID VAR VAR_A VAR_B VAR_C
1 a 1 0 0
2 a 1 0 0
3 a 1 0 0
4 b 0 1 0
5 c 0 0 1
33/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
33 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg3. the dummies(in sparse Matrix)use package Matrix to create the dummies
require(Matrix)
set.seed(114)
Matrix(sample(c(0, 1), 40, re = T, prob = c(0.6, 0.1)), nrow = 5)
## 5 x 8 sparse Matrix of class "dgCMatrix"
##
## [1,] . . . 1 . . . 1
## [2,] . 1 . . . . 1 .
## [3,] 1 . . . . . . .
## [4,] . . . . . 1 . .
## [5,] . . . . . . . .
34/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
34 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg4. the dummies including the interactionID M N MN_AP MN_AQ MN_BP MN_BQ
1 a p 1 0 0 0
2 a p 1 0 0 0
3 a q 0 1 0 0
4 b p 0 0 1 0
5 b q 0 0 0 1
35/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
35 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg5. some derived variables(count & ratio)
the frequency of every category
the frequency of the interactions
the proportion
·
·
·
36/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
36 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg5. some derived variables(count & ratio)
tmp1 <- cnt_1[114:117, c('c1_resource', 'c1_role_deptname')]
tmp2 <- cnt_2[114:117, c('c2_resource_role_deptname_cnt_ij',
'c2_resource_role_deptname_ratio_i',
'c2_resource_role_deptname_ratio_j')]
cbind(tmp1, tmp2)
## c1_resource c1_role_deptname c2_resource_role_deptname_cnt_ij
## 114 1 1645 1
## 115 36 1312 4
## 116 45 465 24
## 117 374 2377 169
## c2_resource_role_deptname_ratio_i c2_resource_role_deptname_ratio_j
## 114 1.0000 0.0006079
## 115 0.1111 0.0030488
## 116 0.5333 0.0516129
## 117 0.4519 0.0710980
37/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
37 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinnggBase Learners
Regularized Generalized Linear Model1.
Support Vector Machine2.
Random Forest3.
Gradient Boosting Machine4.
38/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
38 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinnggEnsemble
mean prediction of all models1.
two-stage stacking2.
based on 5-fold cv holdout predictions·
39/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
39 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinnggEnsemble
mean prediction of all models1.
two-stage stacking2.
based on 5-fold cv holdout predictions
algorithms in level-1(Regularized Generalized Linear Model & Gradient Boosting Machine)
algorithms in level-2(Regularized Generalized Linear Model)
·
·
·
40/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
40 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg1. Regularized Generalized Linear Model
generalized linear model(glm)
convex penalties
·
·
41/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
41 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg1. Regularized Generalized Linear Model
logistic regression·
x <- sort(rnorm(100))
set.seed(114)
y <- c(sample(x=c(0,1),size=30,prob=c(0.9,0.1),re=T),
sample(x=c(0,1),size=20,prob=c(0.7,0.3),re=T),
sample(x=c(0,1),size=20,prob=c(0.3,0.7),re=T),
sample(x=c(0,1),size=30,prob=c(0.1,0.9),re=T))
m1 <- lm(y~x)
m2 <- glm(y~x,family=binomial(link=logit))
y2 <- predict(m2,data=x,type='response')
par(mar=c(5,4,0,0))
plot(y~x);abline(m1,lwd=3,col=2)
points(x,y2,type='l',lwd=3,col=3)
42/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
42 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg1. Regularized Generalized Linear Model
logistic regression·
convex penalties·
43/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
43 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg1. Regularized Generalized Linear Model
convex penalties·
L1 (lasso)
L2 (ridge regression)
mixture of L1&L2 (elastic net)
-
-
-
44/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
44 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg1. Regularized Generalized Linear Model
the dummies(in sparse Matrix)
the dummies including the interaction
R package:glmnet
·
·
·
45/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
45 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg2. Support Vector Machine(just for Diversity)
46/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
46 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg2. Support Vector Machine(just for Diversity)
47/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
47 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg2. Support Vector Machine(just for Diversity)
48/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
48 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg2. Support Vector Machine(just for Diversity)
the dummies including the interaction
some derived variables(count & ratio)
R package:kernlab,e1071
·
·
·
49/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
49 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinnggdecision tree
50/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
50 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg3. Random Forestdecision trees + bagging
51/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
51 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg3. Random Forest
the raw features(as numerical)
the raw features(as categorical) with level reduction
some derived variables(count & ratio)
R package:randomForest
·
·
·
·
52/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
52 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg4. Gradient Boosting Machinedecision trees + boosting
53/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
53 of 65 6/13/14, 2:01 PM
MMooddeell BBuuiillddiinngg4. Gradient Boosting Machine
the raw features(as numerical)
the raw features(as categorical) with level reduction
some derived variables(count & ratio)
R package:gbm
·
·
·
·
54/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
54 of 65 6/13/14, 2:01 PM
SSuummmmaarryysome insightsVARIABLE NAME REL.INF
cnt2_resource_role_deptname_cnt_ij 2.542974017
cnt2_resource_role_rollup_2_ratio_i 2.107624216
cnt2_resource_role_deptname_ratio_j 2.017153645
cnt2_resource_role_rollup_2_ratio_j 1.910465811
cnt2_resource_role_family_ratio_i 1.770737494
... ...
cnt4_resource_mgr_id_role_rollup_2_role_family_desc 0.008938286
cnt4_resource_role_rollup_1_role_rollup_2_role_title 0.008930661
cnt4_resource_mgr_id_role_rollup_1_role_family_desc 0.002106958
55/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
55 of 65 6/13/14, 2:01 PM
SSuummmmaarryysome insights
summary(x[, c('cnt2_resource_role_deptname_cnt_ij',
'cnt2_resource_role_deptname_ratio_j')])
## cnt2_resource_role_deptname_cnt_ij cnt2_resource_role_deptname_ratio_j
## Min. : 1.0 Min. :0.0003
## 1st Qu.: 2.0 1st Qu.:0.0061
## Median : 7.0 Median :0.0172
## Mean : 15.6 Mean :0.0315
## 3rd Qu.: 17.0 3rd Qu.:0.0368
## Max. :201.0 Max. :1.0000
56/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
56 of 65 6/13/14, 2:01 PM
SSuummmmaarryysome insights
xx <- x[, 'cnt2_resource_role_deptname_cnt_ij']
tt <- t.test(xx ~ y)
list(estimate=tt$estimate,
conf.int=tt$conf.int, p.value=tt$p.value)
## $estimate
## mean in group 0 mean in group 1
## 10.04 13.82
##
## $conf.int
## [1] -4.851 -2.710
## attr(,"conf.level")
## [1] 0.95
##
## $p.value
## [1] 5.838e-12
par(mar=c(5,4,2,2))
boxplot(xx ~ y)
57/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
57 of 65 6/13/14, 2:01 PM
SSuummmmaarryysome insights
xxx <- cut(xx, include.lowest=T,
breaks=c(0,1,3,7,14,30,300))
par(mar=c(5,2,0,0))
barplot(table(xxx))
tb <- table(y, xxx)
r_0 <- tb[1, ] / colSums(tb)
par(mar=c(5,2,0,0))
plot(r_0, type='l', lwd=3)
58/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
58 of 65 6/13/14, 2:01 PM
SSuummmmaarryysome insights
xx <- x[, 'cnt2_resource_role_deptname_ratio_j']
tt <- t.test(xx ~ y)
list(estimate=tt$estimate,
conf.int=tt$conf.int, p.value=tt$p.value)
## $estimate
## mean in group 0 mean in group 1
## 0.01955 0.02902
##
## $conf.int
## [1] -0.011732 -0.007205
## attr(,"conf.level")
## [1] 0.95
##
## $p.value
## [1] 3.93e-16
par(mar=c(5,4,2,2))
boxplot(xx ~ y)
59/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
59 of 65 6/13/14, 2:01 PM
SSuummmmaarryysome insights
xxx <- cut(xx, include.lowest=T,
breaks=quantile(xx, seq(0,1,0.2)))
par(mar=c(5,2,0,0))
barplot(table(xxx))
tb <- table(y, xxx)
r_0 <- tb[1, ] / colSums(tb)
par(mar=c(5,2,0,0))
plot(r_0, type='l', lwd=3)
60/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
60 of 65 6/13/14, 2:01 PM
SSuummmmaarryyoverfittingMODEL AUC_CV AUC_PUBLIC AUC_PRIVATE
num_glmnet_0 0.8985069 0.87737 0.87385
stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478
61/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
61 of 65 6/13/14, 2:01 PM
SSuummmmaarryyoverfittingMODEL AUC_CV AUC_PUBLIC AUC_PRIVATE
num_glmnet_0 0.8985069 0.87737 0.87385
stacking_gbm_with_the_glmnet 0.9277316 0.90695 0.90478
stacking_gbm_without_the_glmnet 0.9182303 0.91529 0.91130
62/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
62 of 65 6/13/14, 2:01 PM
SSuummmmaarryyoverfittingWinning solution code and methodologyhttp://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/5283/winning-solution-code-and-methodology
63/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
63 of 65 6/13/14, 2:01 PM
SSuummmmaarryyuseful discussionsPython code to achieve 0.90 AUC with Logistic Regressionhttp://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4838/python-code-to-achieve-0-90-auc-with-logistic-regression
Starter code in python with scikit-learn (AUC .885)http://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4797/starter-code-in-python-with-scikit-learn-auc-885
Patterns in Training data sethttp://www.kaggle.com/c/amazon-employee-access-challenge/forums/t/4886/patterns-in-training-data-set
64/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
64 of 65 6/13/14, 2:01 PM
tthhaannkk yyoouu
65/65
Amazon Employee Access Challenge http://nycopendata.com/KagglerTalk1/index.html
65 of 65 6/13/14, 2:01 PM