naive bayes by seo

42
Naïve Bayes 확률 이론으로 분류하기

Upload: bestkwseo

Post on 11-Aug-2015

91 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Naive Bayes by Seo

Naïve Bayes확률 이론으로 분류하기

Page 2: Naive Bayes by Seo

확률 이론을 바탕으로 하는 분류 방법이다 . 나이브베이스는 베이스 정의 일부분이다 .

장점 : 소량의 데이터를 가지고 작업이 이루어지며 , 여러 개의 분류 항목을 다룰 수 있다 .

단점 : 입력 데이터를 어떻게 준비하느냐에 따라 민감하기 작용한다 .

적용 : 명목형 값

Naïve Bayes

Page 3: Naive Bayes by Seo

두 확률 변수의 사전확률과 사후확률 사이의 관계를 나타내는 정리이다 . 베이즈 확률론 해석에 따르면 베이즈 정리는 새로운 근거가 제시될 때 사후 확률이 어떻게 갱신되는지를 구한다 .

조건부 확률 (Conditional probability) 베이즈 규칙 (Bayes’ rule)

Bayes’ theorem

Page 4: Naive Bayes by Seo

x, y = 속성 p1(x, y) = p1 일 확률 p2(x, y) = p2 일 확률

if p1(x, y) > p2(x, y), then class1 if p1(x, y) < p2(x, y), then class2

-> 더 높은 확률을 가지는 분류항목을 선택한다 .

Bayes’ theorem

Page 5: Naive Bayes by Seo

어떤 사건 B 가 일어났을 때 ( 주어졌을 때 ) 사건 A가 일어날 확률을 의미한다 . 사건 B 가 발생했을 때 사건 A 가 발생할 확률은 사건 B 의 영항을 받아 변하는데 이를 조건부 확률이라 한다 .

기호 : P(A|B) 정의 :

Conditional Probability

)(

)()|(

BP

BAPBAP

Page 6: Naive Bayes by Seo

3 개의 흰색 돌과 4 개의 검은색 돌이 위 그림과 같이 각각 A 와 B 바구니에 들어있다고 한다면 , B 바구니에서 돌을 꺼냈을 때 그 돌이 흰색일 확률은 ?

Example 1

Bucket A Bucket B

Page 7: Naive Bayes by Seo

구해야 하는 확률

Example 1

)|( bucketBwhiteP

Page 8: Naive Bayes by Seo

Example 1

)(

)()|(

bucketBP

bucketBwhitePbucketBwhiteP

7

1)( bucketBwhiteP

7

3)( bucketBP

3

1

7

37

1

∴33.3%

Page 9: Naive Bayes by Seo

Bayes’ Rule

)(

)()|()|(

BP

APABPBAP

)()|()()()|( APABBAPBPBAP

n

i

ii cPcxPxP1

)()|()(

분류항목벡터c※

Page 10: Naive Bayes by Seo

P(A), P(B), P(B|A) 를 알고 있다면 이를 통해 P(A|B) 를 구할 수 있다 .

P(A) = A 의 사전확률 P(B) = B 의 사전확률 P(B|A) = A 가 주어졌을때 B 의 조건부 확률 P(A|B) = 사후확률

Bayes’ theorem

Page 11: Naive Bayes by Seo

문제인구의 1% 정도만이 걸리는 드문 질병에 대해 검사를 하고 있다 . 검사는 매우 민감도가 높고 구체적이지만 완벽하지는 않아서 ,

-아픈 사람의 99% 가 검사에서 양성반응 (‘ 아프다’ ) 을 보였다 .-건강한 사람의 99% 가 검사에서 음성반응 보였다 .

어떤 사람이 양성반응을 보였다면 , 이 사람이 정말로 병에 걸렸을 확률은 얼마인가 ?

Example 2

Page 12: Naive Bayes by Seo

Example 2

99.001.001.099.0

01.099.0

)(

)()|()|(

P

sickPsickPsickP

50.0

∴50%

Page 13: Naive Bayes by Seo

각각의 사건들은 모두 독립적이라고 가정한다 .

Naive

)()...()()()...,,,( nPcPbPaPncbaP

Page 14: Naive Bayes by Seo

만약 P(A1|B) > P(A2|B) 이면 , 항목 A1 에 속한다 .

만약 P(A1|B) < P(A2|B) 이면 , 항목 A2 에 속한다 .

Naïve Bayes Classifier

)(

)()|()|(

BP

APABPBAP

Page 15: Naive Bayes by Seo

대소를 구분하는데 분모 (P(B)) 는 필요하지 않다 .

Naïve Bayes Classifier

)(

)()|()|(

11

1

BP

APABPBAP

)(

)()|()|(

22

2

BP

APABPBAP

동일하다

Page 16: Naive Bayes by Seo

나이브 베이스는 문서 분류 문제를 해결하는데 인기 있는 알고리즘이다 .

각 문서의 등장하는 단어들을 속성처럼 사용하여 해당 문서에 존재하는지 아닌지를 확인하여 분류를 할 수 있다 .

Ex) 스팸 메일 분류 , 게시판 카테고리 분류 등

Practice

)|( emailspamP

Page 17: Naive Bayes by Seo

게시판에 올라온 게시글이 폭력적인지 (abusive), 폭력적이지 않은지 ( not abusive) 를 분류한다 .

Practice – abusive or not abu-sive

)|( docabusiveP

)|( docnotAbusiveP

Page 18: Naive Bayes by Seo

Practice – abusive or not abu-sive

)(

)()|()|(

docp

abusivePabusivedocPdocabusiveP

doc = set of sentences = set of words

= vector of words

)()|( abusivePabusivedocP

=(w1 w2 w3 … wn)

대소구분에 영향을 주지 않는다 .

Page 19: Naive Bayes by Seo

Practice – abusive or not abu-sive

)|()|( abusivewPabusivedocP )...( 21 nwwwwordsofvectorw

by naïve…)|(...)|()|()|( 21 abusivewPabusivewPabusivewPabusivewP n

n

i

abusivewP i

1

)|(

)|,...,( 21 abusivewwwP n

Page 20: Naive Bayes by Seo

Practice – abusive or not abu-sive

)()|()|( abusivePabusivedocPdocabusiveP

)()|( abusivePabusivewP

)()|(1

abusivePabusivewPn

i

i

)()|(...)|()|( 21 abusivePabusivewPabusivewPabusivewP n

Page 21: Naive Bayes by Seo

Puedo Code

Practice – abusive or not abu-sive

Count the number of documents in each classfor every training document:

for each class:if a token appears in the document

-> increment the count for that token

increment the count for tokensfor each class:

for each token:divide the token count by the total token count to get conditional probabilities

return conditional probabilities for each class

Page 22: Naive Bayes by Seo

Python Code

Practice – abusive or not abu-sive

def trainNB0(trainMatrix,trainCategory): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory)/float(numTrainDocs) p0Num = zeros(numWords); p1Num = zeros(numWords) p0Denom = 0.0; p1Denom = 0.0 for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vect = p1Num/p1Denom p0Vect = p0Num/p0Denom return p0Vect,p1Vect,pAbusive

Page 23: Naive Bayes by Seo

Practice – abusive or not abu-sive

please dog flea problems not abusive

maybe park dog stupid not abusive

stop kill stupid garbage abusive

link steak stop dog not abusive

stupid worthless homeless garbage abusive

Training Data

Page 24: Naive Bayes by Seo

Practice – abusive or not abu-sive

Training DatalistOfPosts = { { “please”, “dog”, “flea”, “problem”},

{“maybe”, “park”, “dog”, “stupid”},{“stop”, “kill”, “stupid”, “garbage”},{“link”, “steak”, “stop”, “dog”},{“stupid”, “worthless”, “homeless”,

“garbage”} }

ClassVec = {0,0,1,0,1}0 -> not abusive1 -> abusive

vocaList = {“please”, “dog”, “flea”, “problem”, “maybe”, “park”, “stupid”, “stop”, “kill”, “garbage”, “link”, “steak”, “worthless”, “homeless”}

Page 25: Naive Bayes by Seo

Practice – abusive or not abu-sive

please dog flea problems not abusive

maybe park dog stupid not abusive

stop kill stupid garbage abusive

link steak stop dog not abusive

stupid worthless homeless garbage abusive

Training Data

5

2)( abusiveP

5

3)(1)( abusivePnotAbusiveP

Page 26: Naive Bayes by Seo

Practice – abusive or not abu-sive

vocaList = {“please”, “dog”, “flea”, “problem”, “maybe”, “park”, “s-tupid”, “stop”, “kill”, “garbage”, “link”, “steak”, “worthless”, “homeless”}

trainingMatrix = { {1,1,1,1,0,0,0,0,0,0,0,0,0,0}, {0,1,0,0,1,1,1,0,0,0,0,0,0,0}, {0,0,0,0,0,0,1,1,1,1,0,0,0,0}, {0,1,0,0,0,0,0,1,0,0,1,1,0,0}, {0,0,0,0,0,0,1,0,0,1,0,0,1,1} }

trainCategory = {0,0,1,0,1}

def trainNB0(trainMatrix,trainCategory):

Page 27: Naive Bayes by Seo

Practice – abusive or not abu-sivedef trainNB0(trainMatrix,trainCategory):

please dog flea prob-lem maybe park stupid stop kill

garbage

link steakworthless

homeless

1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

2 0 1 0 0 1 1 0 0 0 0 0 0 0 0

3 0 0 0 0 0 0 1 1 1 1 0 0 0 0

4 0 1 0 0 0 0 0 1 0 0 1 1 0 0

5 0 0 0 0 0 0 1 0 0 1 0 0 1 1

trainingMatrix 1= 등장 , 0 = 미등장

Page 28: Naive Bayes by Seo

Practice – abusive or not abu-sive numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory)/float(numTrainDocs)

numTrainDocs = 훈련데이터의 전체 게시글 ( 행 ) 수 = 5numWords = 훈련데이터 속성 ( 열 , 문자 ) 의 수 = 14trainCatrgory = { 0, 0, 1, 0, 1}sum(trainCategory) = 0+0+1+0+1 = 2

= 폭력적인 게시글의 수pAbusive = P(abusive) = 2/5

Page 29: Naive Bayes by Seo

Practice – abusive or not abu-sive p0Num = zeros(numWords); p1Num = zeros(numWords) p0Denom = 0.0; p1Denom = 0.0

0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0

p0Num =

p1Num =

p0Denom = 0.0

p1Denom = 0.0

Page 30: Naive Bayes by Seo

Practice – abusive or not abu-sive for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i])

p0Num =

p1Num =

p0Denom = 11

p1Denom = 8

폭력적이지 않은 게시글에서 각 단어의 등장 횟수폭력적이지 않은 게시글에서 모든 단어의 등장 횟수

폭력적인 게시글에서 각 단어의 등장 횟수폭력적인 게시글에서 모든 단어의 등장 횟수

1 3 1 1 1 1 0 1 0 0 1 1 0 0

0 0 0 0 0 0 2 1 1 2 0 0 1 1

Page 31: Naive Bayes by Seo

Practice – abusive or not abu-sivep1Vect = p1Num/p1Denomp0Vect = p0Num/p0Denomreturn p0Vect,p1Vect,pAbusive

/11p0Vect =

=

/8p1Vect =

=

= P(w | notAbusive)

= P(w | abusive)

pAbusive = P(abusive) = 2/5

0.09

0.27

0.09

0.09

0.09

0.09

0.00

0.09

0.00

0.00

0.09

0.09

0.00

0.00

1 3 1 1 1 1 0 1 0 0 1 1 0 0

0 0 0 0 0 0 2 1 1 2 0 0 1 10.0

0 0.0

0 0.0

0 0.0

0 0.0

0 0.0

0 0.2

5 0.1

3 0.1

3 0.2

5 0.0

0 0.0

0 0.1

3 0.1

3

Page 32: Naive Bayes by Seo

0 곱하기 문제P(w1 | 1) * P(w2 | 1) * P(w3 | 1) ... * P(wn | 1)

위와 같은 연산 중 하나의 항이라도 0 이 되면 결과값이 0 이 되어버린다 .

해결초기값을 0 으로 두지 않는다 .

Practice – abusive or not abu-sive

p0Num = zeros(numWords)p1Num = zeros(numWords)p0Denom = 0.0; p1Denom = 0.0

p0Num = ones(numWords)p1Num = ones(numWords)p0Denom = 2.0; p1Denom = 2.0

Page 33: Naive Bayes by Seo

Underflow P(w1 | 1) * P(w2 | 1) * P(w3 | 1) ... * P(wn | 1)

위와 같이 0 와 1 사이의 값를 계속해서 곱해나가면 0 에 가까워지게 된다 . 그런데 컴퓨터가 표현할 수 있는 수의 범위에는 한계가 있기 때문에 결과값이 이 범위를 넘어선다면 언더플로우가 발생하여 올바르지 않은 값을 산출하게 된다 .

해결ln 를 취하도록 한다 . ln 을 취하면 곱셈은 덧셈으로 바뀌지만 결과값은 서로 비례한다 . f(x) 와 ln(f(x)) 는 함께 증가한다 .

ln(a * b) = ln(a) + ln(b) ln(P(w1|1)*P(w2|1) ... * P(wn|1))

= ln(P(w1|1))+ln(P(w2|1)) ... +ln(P(wn|1))

Practice – abusive or not abu-sive

Page 34: Naive Bayes by Seo

Practice – abusive or not abu-sive

p1Vect = log(p1Num/p1Denom)p0Vect = log(p0Num/p0Denom)

p1Vect = p1Num/p1Denomp0Vect = p0Num/p0Denom

/13p0Vect =

/10p1Vect =

/13

/10

ln( )

ln( )

2 4 2 2 2 2 1 2 1 1 2 2 1 1

2 4 2 2 2 2 1 2 1 1 2 2 1 1

-1.87

-1.18

-1.87

-1.87

-1.87

-1.87

-2.56

-1.87

-2.56

-2.56

-1.87

-1.87

-2.56

-2.56

1 1 1 1 1 1 3 2 2 3 1 1 2 2

1 1 1 1 1 1 3 2 2 3 1 1 2 2

-2.30

-2.30

-2.30

-2.30

-2.30

-2.30

-1.20

-1.61

-1.61

-1.20

-2.30

-2.30

-1.61

-1.61

Page 35: Naive Bayes by Seo

Improved Python Code

Practice – abusive or not abu-sive

def trainNB0(trainMatrix,trainCategory): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory)/float(numTrainDocs) p0Num = ones(numWords); p1Num = ones(numWords) #change to ones() p0Denom = 2.0; p1Denom = 2.0 #change to 2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vect = log(p1Num/p1Denom) #change to log() p0Vect = log(p0Num/p0Denom) #change to log() return p0Vect,p1Vect,pAbusive

Page 36: Naive Bayes by Seo

classify function

Practice – abusive or not abu-sive

p1 = sum(vec2Classify * p1Vec) + log(pClass1) p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) if p1 > p0: return 1 else: return 0

※ vec2Classify = 입력 데이터 . 새로 등록된 게시글에 등장한 문자에 대한 벡터 )

Page 37: Naive Bayes by Seo

입력

Practice – abusive or not abu-sive

inX = 입력 데이터 = {“stupid”, “dog”, “garbage”, “worthless”}

vec2Classify = {0,1,0,0,0,0,1,0,0,1,0,0,1,0}

please dog flea prob-lem maybe park stupid stop kill

garbage

link steakworthless

homeless

0 1 0 0 0 0 1 0 0 1 0 0 1 0

Page 38: Naive Bayes by Seo

Practice – abusive or not abu-sive p1 = sum(vec2Classify * p1Vec) + log(pClass1)

p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)

vec2Classify =

p1Vec =

pClass1 = P(abusive) = 2/7 = 0.29

0 1 0 0 0 0 1 0 0 1 0 0 1 0

-2.30

-2.30

-2.30

-2.30

-2.30

-2.30

-1.20

-1.61

-1.61

-1.20

-2.30

-2.30

-1.61

-1.61

0 -2.30 0 0 0 0 -

1.20 0 0 -1.20 0 0 -

1.61 0

vec2Classify * p1Vec =

p1 = sum(vec2Classify * p1Vec) + log(pClass1) = = -2.3 -1.2 -1.2 -1.61 -1.25

= -7.6

Page 39: Naive Bayes by Seo

Practice – abusive or not abu-sive p1 = sum(vec2Classify * p1Vec) + log(pClass1)

p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)

vec2Classify =

p0Vec =

pClass0 = P(notAbusive) = 1 - P(abusive) = 1 - 0.29 = 0.71

0 1 0 0 0 0 1 0 0 1 0 0 1 0

0 -1.18 0 0 0 0 -

2.56 0 0 -2.56 0 0 -

2.56 0

vec2Classify * p1Vec =

p0 = sum(vec2Classify * p0Vec) + log(pClass0) = = -1.18 – 2.56 – 2.56 – 2.56 - 0.34

= -9.2

-1.87

-1.18

-1.87

-1.87

-1.87

-1.87

-2.56

-1.87

-2.56

-2.56

-1.87

-1.87

-2.56

-2.56

Page 40: Naive Bayes by Seo

Practice – abusive or not abu-sive if p1 > p0:

return 1 else: return 0

p1 = P(abusive) = -7.6p0 = P(notAbusive) = -9.2p1 > p0 : p1 분류 항목에 속한다 .

∴ 폭력적인 게시물이다 .

Page 41: Naive Bayes by Seo

Q & A

Page 42: Naive Bayes by Seo

감사합니다 .