데이터 과학 입문 13장

Doing�� Data�� Science�� ch.13�� 데이터�� 누출과�� 모형�� 평가

cecil

이�� 장에서는?

데이터�� 경진대회�� 참가를�� 통해�� 얻은�� 경험을�� 바탕으로��

데이터�� 누출과�� 모형을�� 평가하는�� 방법을�� 설명

데이터와�� 해당�� 분야의�� 이해를�� 갖추는�� 것이�� 중요��

데이터�� 마이닝�� 경진�� 대회를�� 통해�� 얻은�� 교훈��

•Leakage(누출)��

•Real-life�� performance�� measures��

•Feuture�� construction/transformation

좋은�� 모형�� 개발자가�� 되는�� 방법

데이터�� 누출(Leakage)

“사실이라고�� 하기에는�� 너무�� 성능이�� 좋다면”,��

시대착오anachronism이�� 존재한다는�� 명확한�� 증거��

-�� Dorian�� Pyle

누출?�� 어떤것을�� 예측할�� 때�� 도와주는�� 데이터��

아마도�� 올바르지�� 않은(?)

아마존의�� 사례�� 연구:�� 돈을�� 많이�� 쓰는�� 사람�� 찾기

•목표:�� 과거의�� 구매�� 데이터를�� 이용하여�� 돈을�� 많이�� 쓸�� 것�� 같은�� 고객��

가려내기�� (여러�� 범주의�� 거래�� 데이터)��

•우승작품이�� 밝혀�� 낸것:�� “무료배송=True”�� 가�� 훌륭한�� 예측�� 변수��

•문제점��

•무료�� 배송은�� 대량�� 소비의�� 효과!��

•무료�� 배송�� 데이터는�� 매출과�� 동시에�� 발생�� (신규�� 고객에�� 맞지�� 않음)

보석�� 표본�� 추출

•목표:�� 보석을�� 살�� 고객을�� 예측��

•성공적인�� 모형:�� sum(revenue)�� =�� 0�� 인지�� 확인하는�� 모형��

•문제점��

•대회�� 데이터�� 준비자들이�� 보석�� 구매�� 사실은�� 삭제하고,�� 무언가를�� 구

매한�� 사람을�� 데이터로�� 구성��

•즉,�� 이�� 모형은�� 올바른�� 데이터로�� 훈련된�� 것이�� 아니었음.

IBM의�� 고객�� 타깃팅

•목표:�� 자사의�� “웹�� 스피어”�� 솔루션을�� 기꺼이�� 구매할�� 회사들을�� 예측��

•데이터:�� 거래�� 데이터와�� 잠재적인�� 고객사�� 웹사이트의�� 내용��

•우승한�� 모형:�� 어떤�� 회사의�� 웹�� 사이트에�� 웹스피어라는�� 용어�� 가�� 있

는지�� 유무가�� 강력한�� 예측�� 변수��

•문제점��

•웹�� 스피어를�� 아직�� 구매하지�� 고객�� 사이트에는�� 웹�� 스피어가�� 없을�� 것��

•만약�� 웹�� 스피어가�� 존재하기�� 전이라면�� 예측�� 변수가�� 의미가�� 있을�� 것��

토론:�� 유방암�� 감지

Figure 13-1. Patients ordered by patient identifier; red means cancer‐ous, green means not

This situation led to an interesting discussion in the classroom:Student: For the purposes of the contest, they should have renum‐bered the patients and randomized.Claudia: Would that solve the problem? There could be other thingsin common as well.Another student: The important issue could be to see the extent towhich we can figure out which dataset a given patient came frombased on things besides their ID.Claudia: Think about this: what do we want these models for in thefirst place? How well can you really predict cancer?

Given a new patient, what would you do? If the new patient is in a fifthbin in terms of patient ID, then you wouldn’t want to use the identifiermodel. But if it’s still in this scheme, then maybe that really is the bestapproach.

This discussion brings us back to the fundamental problem: we needto know what the purpose of the model is and how it is going to beused in order to decide how to build it, and whether it’s working.

Pneumonia PredictionDuring an INFORMS competition on pneumonia predictions in hos‐pital records—where the goal was to predict whether a patient haspneumonia—a logistic regression that included the number of diag‐nosis codes as a numeric feature (AUC of 0.80) didn’t do as well as theone that included it as a categorical feature (0.90). What happened?

Data Leakage | 311

www.it-ebooks.info

상관�� 없는�� 것처럼�� 보이는�� 식별�� 번호가�� 실제로�� 예측력을�� 가지고�� 있음

모형의�� 목적은�� 무엇인가?��

목적을�� 달성하기�� 위한�� 모형을�� 어떻게�� 만들것인가?�� 가능한가?

누출을�� 피하는�� 법

1. 시간�� 절단을�� 엄격하게�� 적용하라��

•관심대상�� 사건의�� 바로�� 이전의�� 모든�� 정보를�� 제거��

•모든�� 일에는�� 그�� 일이�� 발생했을때가�� 아니라�� 그�� 일에�� 대한�� 정보를��

알았을때�� 해당�� 시간�� 기록이�� 있어야�� 함��

2. 원시�� 데이터를�� 가지고�� 백지�� 상태에서�� 시작하라��

3. 데이터가�� 어떻게�� 만들어진�� 것인지�� 알야아�� 함

모형�� 평가하기

Figure 13-3. This classic image from Hastie and Tibshirani’s Elementsof Statistical Learning (Springer-Verlag) shows fitting linear regres‐sion to a binary response, fitting 15-nearest neighbors, and fitting1-nearest neighbors all on the same dataset

The picture on the left is underfit, in the middle it’s good, and on theright it’s overfit.

The model you use matters when it concerns overfitting, as shown inFigure 13-4.

Figure 13-4. The model you use matters!

Looking at Figure 13-4, unpruned decision trees are the overfitting-est (we just made that word up). This is a well-known problem withunpruned decision trees, which is why people use pruned decisiontrees.

314 | Chapter 13: Lessons Learned from Data Competitions: Data Leakage and ModelEvaluation

www.it-ebooks.info

모형의�� 패턴을�� 찾는�� 강력한�� 알고리즘에도�� 심각한�� 과적합�� 위험이�� 존재

동일한�� 데이터에�� 대해�� 각각�� 이진�� 반응에�� 대하여

선형회귀 15-근접�� 이웃 1-근접�� 이웃

모형�� 평가하기:�� 정확도

이전�� 장에서�� 이야기�� 했던�� 정확도는�� 조심해서�� 사용해야�� 함��

대부분의�� 결과값이�� 1인�� 데이터에서��

“항상�� 1”이라고�� 추정하는�� 어이�� 없는�� 모형이�� 높은�� 정확도를�� 보일수��

있는�� 반면,�� 더�� 좋은�� 모형이�� 낮은�� 정확도를�� 가질수�� 있음

모형�� 평가하기:�� 0과�� 1이�� 아닌�� 확률이�� 중요

사람들은�� 확률을�� 좋아�� 한다.

Figure 13-5. An example of how to draw an ROC curve

Sometimes to measure rankings, people draw the so-called lift curveshown in Figure 13-6.

Figure 13-6. The so-called lift curve

The key here is that the lift is calculated with respect to a baseline. Youdraw it at a given point, say 10%, by imagining that 10% of people are

316 | Chapter 13: Lessons Learned from Data Competitions: Data Leakage and ModelEvaluation

www.it-ebooks.info

ROC�� Curve?��

민감도(TPR)와�� 특이도(FPR)의�� 관계를�� 나타낸�� 그래프��

AUC(ROC의�� 밑면)을�� 계산하여�� 판단��

0.5�� ~�� 1�� 사이의�� 값이�� 나오며,�� 1에�� 가까울�� 수록�� 좋음

알고리즘�� 선택

데이터�� 세트의�� 크기와�� 속성에�� 맞추어�� 선택��

데이터에�� 기초하여�� 평가�� 방법�� 선택��

가장�� 중요한�� 목적에�� 맞추어�� 알맞은�� 알고리즘�� 선택

결론

그래서�� 중요한건...��

•주의�� 깊은�� 데이터�� 생성��

•문제에�� 대한�� 세심한�� 성찰��

•실제로�� 써먹을�� 수�� 있는�� 방식으로�� 데이터를�� 모형화��

•정말로�� 원하는�� 것을�� 최적화�� 하고�� 있는�� 것에�� 대한�� 확신��

•어떤�� 알고리즘이�� 어떤�� 작업에�� 적합한지에�� 대한�� 지식

References• Rachel�� Schutt,�� Cathy�� O’Neil,�� 데이터�� 과학�� 입문(윤영민,�� 허선,��

전희주,�� 김정일,�� 류자현�� 옮김).�� 서울시�� 마포구:�� 한빛�� 미디어,�� 2014

데이터 과학 입문 13장

Technology