数学特別講義：機械学習の数理 (note-00)...-bxpg-bshf/vncfst o bwfsbhf 25/42...

数学特別講義：機械学習の数理 (note-00)

• 担当：金森敬文 (e-mail: [email protected])

http://www.kana-lab.c.titech.ac.jp/2018osaka.html

• 講義の予定：詳細は講義サイトを参照のこと．

∗ 線形回帰モデルとカーネル法：再生核ヒルベルト空間∗ 機械学習の問題設定：予測誤差，学習誤差，ベイズルール∗ 予測誤差の評価：確率不等式，Rademacher複雑度∗ 学習アルゴリズム：サポートベクトルマシンの統計的一致性

• 成績評価

∗ 講義中に出題するレポートの結果を総合して評価する．

1/42

• 参考文献 (論文はgoogleなどでタイトルを検索すればpdfを入手可)

∗ 講義全般：・ [書籍] Shai Shalev-Shwartz, et al., Understanding Machine Learning:

From Theory to Algorithms, Cambridge University Press, 2014. (ウェブからDL可能)

・ [サーベイ論文] O. Bousquet, S. Boucheron, G. Lugosi, “Introduction

to Statistical Learning Theory”, 2004.

・ [書籍] Mohri, et al., Foundations of Machine Learning, MIT Press,

2012．

2/42

— 機械学習の枠組—

3/42

観測されたデータ =⇒ 有用な情報を取り出す

• 講義では主に回帰分析・判別分析を扱う：データ (x1, y1), . . . , (xn, yn) から x と y の間の関係を推定する．

• その他の問題設定

∗ 次元削減：高次元データを，情報量を保ちつつ低次元に圧縮．∗ クラスタリング：データをいくつかのグループに分ける．

4/42

統計的データ解析と確率論

• 観測データは複雑（ノイズの影響など）

• 確率的なモデリングが有効

モデリング = [確定した構造] + [ランダムな構造]

• 確率論を基礎にして，データを解析する

∗ ただし本講義では，厳密な測度論的な取り扱いはしない

5/42

問題設定

x を入力すると y が出力される：

例： x −→ ?? −→ y

• データ (x1, y1), . . . , (xn, yn) が観測されている．

• 新たな入力 x に対する出力 y を予測

判別 (classification)：y が有限集合に値をとる．

回帰 (regression)：y は実数値をとる．

−4 −2 0 2 4

−4−2

02

4

C=1

X1

X2

6/42

機械学習の応用例

機械学習：データからパターンを発見，将来の予測などに役立てる．

例：文字認識，画像認識，音声認識など．

• スパム（spam:迷惑メール）フィルター

∗ x：メール（テキストデータ），y ∈ spam, non-spam

∗ サンプル：沢山の（過去の）スパムメールと普通のメール

∗ 将来のメールがスパムメールかどうかを判定して，仕分けする

7/42

• 文字認識：画像から文字を識別する．例：x ∈ R64, y ∈ 0, 1, 2, . . . , 9.

2 4 6 8

24

68

→ 0,2 4 6 8

24

68

→ 3,2 4 6 8

24

68

→ 7 or 9?

郵便番号の読み取りなどに実用化されている．

• 顔検出：顔がある位置を検出．デジカメなどに搭載．

148 Viola and Jones

5. Results

This section describes the final face detection system.The discussion includes details on the structure andtraining of the cascaded detector as well as results ona large real-world testing set.

5.1. Training Dataset

The face training set consisted of 4916 hand labeledfaces scaled and aligned to a base resolution of 24 by24 pixels. The faces were extracted from images down-loaded during a random crawl of the World Wide Web.Some typical face examples are shown in Fig. 8. Thetraining faces are only roughly aligned. This was doneby having a person place a bounding box around eachface just above the eyebrows and about half-way be-tween the mouth and the chin. This bounding box wasthen enlarged by 50% and then cropped and scaled to24 by 24 pixels. No further alignment was done (i.e.the eyes are not aligned). Notice that these examplescontain more of the head than the examples used by

Figure 8. Example of frontal upright face images used for training.

Rowley et al. (1998) or Sung and Poggio (1998). Ini-tial experiments also used 16 by 16 pixel training im-ages in which the faces were more tightly cropped,but got slightly worse results. Presumably the 24 by24 examples include extra visual information such asthe contours of the chin and cheeks and the hair linewhich help to improve accuracy. Because of the natureof the features used, the larger sized sub-windows donot slow performance. In fact, the additional informa-tion contained in the larger sub-windows can be usedto reject non-faces earlier in the detection cascade.

5.2. Structure of the Detector Cascade

The final detector is a 38 layer cascade of classifierswhich included a total of 6060 features.

The first classifier in the cascade is constructed us-ing two features and rejects about 50% of non-faceswhile correctly detecting close to 100% of faces. Thenext classifier has ten features and rejects 80% of non-faces while detecting almost 100% of faces. The nexttwo layers are 25-feature classifiers followed by three50-feature classifiers followed by classifiers with a

152 Viola and Jones

Figure 10. Output of our face detector on a number of test images from the MIT + CMU test set.

6. Conclusions

We have presented an approach for face detectionwhich minimizes computation time while achievinghigh detection accuracy. The approach was used to con-struct a face detection system which is approximately15 times faster than any previous approach. Preliminaryexperiments, which will be described elsewhere, showthat highly efficient detectors for other objects, such aspedestrians or automobiles, can also be constructed inthis way.

This paper brings together new algorithms, represen-tations, and insights which are quite generic and maywell have broader application in computer vision andimage processing.

The first contribution is a new a technique for com-puting a rich set of image features using the integralimage. In order to achieve true scale invariance, almostall face detection systems must operate on multipleimage scales. The integral image, by eliminating theneed to compute a multi-scale image pyramid, reducesthe initial image processing required for face detection

学習データ（正例）検出結果

P. Viola & M. J. Jones, Robust Real-Time Face Detection Journal International Journal of Computer Vision, 2004

8/42

— 確率論の復習 —

9/42

• 確率変数(r.v.)：ランダムな値をとる変数 X（通常は大文字で書く）

• 標本空間 Ω に値をとる確率変数 X に関する確率 P ( · ) の定義：

1. 事象 A ⊂ Ω に対して 0 ≤ P (X ∈ A) ≤ 1．2. 全事象 Ω の確率は 1．P (X ∈ Ω) = 1

3. 互いに排反な事象 Ai, i = 1, 2, 3, . . .

に対して

P (X ∈ ∪iAi) =∑i

P (X ∈ Ai).

（互いに排反： i = j に対して Ai ∩Aj = ϕ）

簡単のため P (X ∈ A) を P (A), P (A) と表すこともある．

!

A1 A2

10/42

確率の計算公理（だけ）から確認できる

• P (A) + P (Ac) = 1, Ac : Aの補集合

• 単調性： A ⊂ B ⊂ Ω =⇒ P (A) ≤ P (B)．A ⊂ B のとき B = A ∪ (B ∩Ac) （互いに排反）．∴ P (B) = P (A) + P (B ∩Ac) ≥ P (A).

• 加法定理：P (A ∪B) = P (A) + P (B)− P (A ∩B),

P (∪iAi) ≤∑

iP (Ai).

11/42

連続値をとる確率変数 (例：正規分布)

確率変数 X が 1次元正規分布にしたがう：

P (a ≤ X ≤ b) =

∫ b

a

1√2πσ2

e−(x−µ)2

2σ2 dx, (Ω = R)

このとき X ∼ N(µ, σ2) と表す．

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

density

−4 −2 0 2 4 6 8

0.0

0.1

0.2

0.3

0.4

normal density

x

p(x)

N(0,1)

N(2,4)

の面積 = P (1 ≤ X ≤ 2)

12/42

確率密度関数

n個の確率変数：X1, X2, . . . , Xn．集合 A ⊂ Rn．X = (X1, X2, . . . , Xn) が集合 A に含まれる確率が次式で定まるとする：

P (X ∈ A) =

∫A

p(x1, . . . , xn)dx1 · · · dxn

• (同時)確率密度関数：p(x) = p(x1, . . . , xn) ≥ 0,∫Rn

p(x)dx = 1. (略して確率密度，密度)

13/42

• 周辺密度：p1(x1) =

∫Rn−1

p(x1, . . . , xn)dx2 · · · dxn など．

離散確率変数(Aが加算集合)のとき∫A

· · · dx を∑x∈A

· · · に置き換える．

14/42

確率変数の期待値・分散

X = (X1, . . . , Xn) の密度を p(x1, . . . , xn) とする．

• Xi の期待値：「散らばりの中心」を表す．

E[Xi] =

∫Rd

xi p(x1, . . . , xd)dx1 · · · dxd =

∫Rd

xi p(x)dx ∈ R

• X = (X1, . . . , Xd)T の期待値： E[X] =

E[X1]

...

E[Xd]

∈ Rd

• 1次元確率変数 Xi の分散：「散らばりの大きさ」を表す．

V[Xi]定義= E[(Xi − E[Xi])

2] = E[X2i ]− E[Xi]

2

15/42

独立性，独立同一分布

n個の確率変数：X1, X2, . . . , Xn．

• X1, . . . , Xnが独立 ⇐⇒ 同時密度関数が積に分解

p(x1, . . . , xn) = p1(x1)p2(x2) · · · pn(xn)

• X1, . . . , Xnが独立に同一の分布にしたがう：

p(x1, . . . , xn) = q(x1)q(x2) · · · q(xn), (q = p1 = · · · = pn)

16/42

X,Y は独立のとき以下の等式が成立：

E[XY ] = E[X]E[Y ],

V[X + Y ] = V[X] + V[Y ] note: E[X + Y ] = E[X] + E[Y ] は常に成立．

17/42

• X1, . . . , Xn が独立に同一の分布 P にしたがうとき：

X1, . . . , Xn ∼i.i.d. P または (X1, . . . , Xn) ∼ Pn と書く．

(例えば X1, . . . , Xn ∼ N(0, 1) と書く)

このとき．X1, . . . , Xn の期待値や分散は全て等しい：

E[X1] = · · · = E[Xn], V[X1] = · · · = V[Xn].

• X1, . . . , Xn ∼i.i.d. P，µ = E[Xi], σ2 = V [Xi] のとき

Y =1

n

n∑i=1

Xi =⇒ E[Y ] = µ, V[Y ] =σ2

n

note: E[aX + b] = aE[X] + b, V[aX + b] = a2V[X]

18/42

条件付き確率・条件付き確率密度

• X,Y に関する確率 P (X ∈ A, Y ∈ B) が与えられているとき「X ∈ A の条件のもとで Y ∈ B」となる確率：

P (Y ∈ B | X ∈ A) =P (X ∈ A, Y ∈ B)

P (X ∈ A)

X ∈ A

Y ∈ B

19/42

• 密度 p(x, y) のもとで，x が与えられたときの y の条件付き密度

p(y|x) := p(x, y)∫p(x, y)dy

=p(x, y)

pX(x)(pX(x)：x の周辺密度)

「X ∈ [x, x+ dx] の条件下で Y ∈ [y, y + dy] となる確率」

=P (X ∈ [x, x+ dx], Y ∈ [y, y + dy])

P (X ∈ [x, x+ dx])

=p(x, y)dxdy

pX(x)dx= p(y|x)dy

性質： ∀x, y, p(y|x) ≥ 0,

∫p(y|x)dy = 1．

20/42

混合分布：条件付き確率密度の例

X は R 上の確率変数，Y は 0, 1 上の確率変数．

P (Y = 0) = q, P (Y = 1) = 1− q,

Xの条件付き密度 : p(x|Y = 0) = p0(x), p(x|Y = 1) = p1(x)

このとき X の周辺密度 p(x) は

p(x) =∑y=0,1

p(x, y) = q · p0(x) + (1− q) · p1(x). (p0とp1の混合分布)

22/42

漸近理論：大数の法則

X1, . . . , Xn ∼i.i.d . P として E(Xi) = µ とおく．

• 大数の法則：Xn定義=

1

n

n∑i=1

Xi とすると

∀ε > 0, limn→∞

P (|Xn − µ| > ε) = 0

∗ nが十分大きいと，高い確率で Xn は µ にほぼ等しい．

定義：確率変数列 Znn∈N が a ∈ R に確率収束する

∀ε > 0, P (|Zn − a| > ε) → 0 (n → ∞)

24/42

• 大数の法則より Xnp−→ µ．

• f(x) を連続関数とするとき，普通の極限と類似の関係が成立．

Xnp−→ µ ならば f(Xn)

p−→ f(µ)

Example 1 (大数の法則の例).

表の確率が0.7のコイン．n回振って表が出た割合をプロット．

0 200 400 600 800

0.5

0.6

0.7

0.8

0.9

1.0

Law of Large Numbers

n

average

25/42

確率不等式

統計的学習理論で使う不等式

• Jensen’s inequality

• Markov’s inequality

• Hoeffding’s lemma, Hoeffding’s inequality

• Massart’s lemma

• Azuma’s inequlity, McDiarmid inequality: Hoeffding’s の一般化

26/42

Jensen’s ineq.

凸関数の定義：

f(αx+ (1− α)y) ≤ αf(x) + (1− α)f(y), x,y ∈ Rd, α ∈ [0, 1].

Theorem 1 (Jensen’s ineq.).凸関数f : Rd → R，d次元確率変数 X に対して次式が成立：

f(E[X]) ≤ E[f(X)]

27/42

Markov/Chebyshev 不等式

• r.v. X ≥ 0 に対して

∀ε > 0, P (X ≥ ε) ≤ E[X]

ε

• Chebyshev 不等式

∀ε > 0, P (|X − E[X]| ≥ ε) ≤ V[X]

ε2

28/42

Lemma 1 (Hoeffding’s lemma).

確率変数Xが E[X] = 0, a ≤ X ≤ b を満たすとき，任意の t > 0 に対して

E[etX] ≤ et2(a−b)2/8.

以下の関係を使う．

ℓ(x) = αx+ β

⇒ E[ℓ(X)] = β

= pe(1−p)u+(1−p)e−pu

29/42

Proof. t = 1の場合を示す．

p ∈ [0, 1], u ≥ 0として a = −pu, b = (1 − p)uとする．区間 [a, b]上でex ≤ ℓ(x)，かつ E[X] = 0 のとき

E[eX] ≤ E[ℓ(X)] = pe(1−p)u + (1− p)e−pu.

以下で

pe(1−p)u + (1− p)e−pu ≤ eu2/8, p ∈ [0, 1], u ≥ 0

を示す．

pを固定して ϕ(u) = log[pe(1−p)u + (1− p)e−pu] とおくと

30/42

ϕ(0) = 0, ϕ′(0) = 0，

ϕ′′(u) =1− p

1− p+ peu· peu

1− p+ peu=

1− p

1− p+ peu·(1− 1− p

1− p+ peu

)≤ 1

4

となる．テイラーの定理より，0 ≤ v ≤ u となるvが存在して

ϕ(u) = ϕ(0) + uϕ′(0) +u2

2ϕ′′(v) ≤ u2

8.

31/42

Lemma 2 (Hoeffding’s inequality).確率変数X1, . . . , Xnは独立とし，Xi は確率 1 で有界区間 [ai, bi] に値をとるとする．S =

∑ni=1Xiとすると，任意

の ε > 0 に対して

P (S − E[S] ≥ ε) ≤ exp

− 2ε2∑n

i=1(bi − ai)2

,

P (S − E[S] ≤ −ε) ≤ exp

− 2ε2∑n

i=1(bi − ai)2

が成立．Remark .X1, . . . , Xn ∼i.i.d. P, E[Xi] = µ, a ≤ Xi ≤ b とすると

P

(∣∣∣∣1nn∑

i=1

Xi − µ

∣∣∣∣ ≥ ε

)≤ 2e−2nε2/(b−a)2

32/42

Proof of Hoeffding’s inequality. : 正値パラメータt > 0を導入すると

P (S − E[S] ≥ ε) = P (et(S−E[S]) ≥ etε)

≤ e−tεE[et(S−E[S])] (Markov’s ineq.)

= e−tεn∏

i=1

E[et(Xi−E[Xi])]

≤ e−tεet2∑n

i=1(bi−ai)2/8. (Hoeffding’s lemma)

33/42

上の不等式は任意のt > 0で成り立つので，上界が最小になるように

t = 4ε/n∑

i=1

(bi − ai)2 > 0

とおくと

P (S − E[S] ≥ ε) ≤ exp− 2ε2/

n∑i=1

(bi − ai)2

が得られます． 2番目の不等式も同様に得られます．

34/42

Lemma 3 (Massart’s lemma).

A ⊂ Rm, |A| < ∞，Rm上の2-ノルムを∥ · ∥として maxx∈A ∥x∥ ≤ r．σ1, . . . , σm ∼i.i.d. Ber(1/2) (つまり P (σi = +1) = P (σi = −1) = 1/2).

このとき

Eσ

[1

msupx∈A

m∑i=1

σixi

]≤

r√

2 log |A|m

, x = (x1, . . . , xm) ∈ Rm,

が成立．

cf. Rademacher複雑度

35/42

Proof. 任意の t > 0 に対して，指数関数の凸性とイェンセンの不等式より，

exp

Eσ

[t supx∈A

m∑i=1

σixi

]≤ Eσ

[exp

t supx∈A

m∑i=1

σixi

]

≤∑x∈A

Eσ

[exp

t

m∑i=1

σixi

]

=∑x∈A

m∏i=1

Eσi

[exp

tσixi

](1)

が成立(最後の等式はσ1, . . . , σmの独立性)．

Hoeffding’s lemma より

Eσi

[exp

tσixi

]≤ et

2(2xi)2/8.

36/42

よって (1) は |A|et2r2/2 で上から抑えられる．上界の対数をtで割ると

Eσ

[supx∈A

m∑i=1

σixi

]≤ tr2

2+

log |A|t

となり，t =

√2 log |A|r

を代入して

Eσ

[supx∈A

m∑i=1

σixi

]≤ r√

2 log |A|

が得られる．両辺を m で割る．

37/42

Lemma 4 (Azuma’s inequality).

確率変数Xi, Zi, Vi, i = 1, . . . , nに対して，ViはX1, . . . , Xiの関数として表され E[Vi|X1, . . . , Xi−1] = 0 とする．また Zi は X1, . . . , Xi−1 の関数であり，定数c1, . . . , cnが存在してZi ≤ Vi ≤ Zi+ ci が成り立つとする．このとき任意のε > 0に対して

Pr

(n∑

i=1

Vi ≥ ε

)≤ exp

− 2ε2∑n

i=1 c2i

,

Pr

(n∑

i=1

Vi ≤ −ε

)≤ exp

− 2ε2∑n

i=1 c2i

が成立．

38/42

Proof. 部分和∑k

i=1 ViをSkとおく．ヘフディングの不等式の証明と同様に，正値パラメータ tを導入して上界を求める．以下で t = 4ε/

∑ni=1 c

2i > 0と

おくと

Pr(Sn ≥ ε

)≤ e−tεE[etSn]

= e−tεEX1,...,Xn−1[etSn−1EXn[e

tVn|X1, . . . , Xn−1]]

≤ e−tεEX1,...,Xn−1[etSn−1]et

2c2n/8 (ヘフディングの補題)

≤ e−tεet2∑n

i=1 c2i/8

= e−2ε2/∑n

i=1 c2i

となる．同様にして2番目の不等式が得られる．

39/42

Lemma 5 (McDiarmid’s inequality).集合Xに値をとる独立な確率変数をX1, . . . , Xnとする．また関数f : Xn → Rに対して定数 c1, . . . , cnが存在して

|f(x1, . . . , xi−1, xi, xi+1, . . . , xn)− f(x1, . . . , xi−1, x′i, xi+1, . . . , xn)| ≤ ci,

i = 1, . . . , n

が任意のx1, . . . , xn, x′i ∈ Xに対して成り立つとする．このとき次式が成立：

Pr(f(X1, . . . , Xn)− E[f(X1, . . . , Xn)] ≥ ε

)≤ exp

− 2ε2∑n

i=1 c2i

,

Pr(f(X1, . . . , Xn)− E[f(X1, . . . , Xn)] ≤ −ε

)≤ exp

− 2ε2∑n

i=1 c2i

.

40/42

note: f(x) =n∑

i=1

xiとすると Hoeffding’s ineq. に一致．

Proof. f(X1, . . . , Xn)をf(S)と表す．またV1, . . . , Vnを

Vk = E[f(S)|X1, . . . , Xk]− E[f(S)|X1, . . . , Xk−1]

と定義する．ここでV1はE[f(S)|X1]− E[f(S)]を表す．以下で，このVkがLemma 4の仮定を満たすことを示す．

定義からVkはX1, . . . , Xkの関数して表せる．また条件付き期待値の性質から

E[Vk|X1, . . . , Xk−1] = 0

41/42

を満たす．さらに関数fに対する仮定より

supx

E[f(S)|X1, . . . , Xk−1, x]− infx′

E[f(S)|X1, . . . , Xk−1, x′]

= supx,x′

E[f(S)|X1, . . . , Xk−1, x]− E[f(S)|X1, . . . , Xk−1, x′] ≤ ck

となる．よって

Zk = infx

E[f(S)|X1, . . . , Xk−1, x]− E[f(S)|X1, . . . , Xk−1]

とすれば Xk, Vk, Zk, k = 1, . . . , nがLemma 4の仮定を満たすことが分かる．以上より，

∑ni=1 Vi = f(S) − E[f(S)] に対するAzuma’s ineq. から定理が

得られる．

42/42

数学特別講義：機械学習の数理 (note-00)...-bxpg-bshf/vncfst o bwfsbhf 25/42...

Documents