Deep learning study 1

모든 것을 확률로 생각하자!
Setosa
Versicolor
Virginica
Iris Feature
Sepal Length
Sepal Width
Petal Length
⋮
Distribution
⋮

Classification 문제도 확률로 생각하자!
𝑃 𝑆𝑒𝑡𝑜𝑠𝑎 𝒙 = 𝟎. 𝟏
𝑃 𝑉𝑒𝑟𝑠𝑖𝑐𝑜𝑙𝑜𝑟 𝒙 = 𝟎. 𝟖𝟓
𝑃 𝑉𝑖𝑟𝑔𝑖𝑛𝑖𝑐𝑎 𝒙 = 𝟎. 𝟎𝟓
argmax
𝑖
𝑃(𝐶𝑙𝑎𝑠𝑠𝑖|𝒙)
𝐶𝑙𝑎𝑠𝑠0 = 𝑆𝑒𝑡𝑜𝑠𝑎
𝐶𝑙𝑎𝑠𝑠1 = 𝑉𝑒𝑟𝑠𝑖𝑐𝑜𝑙𝑜𝑟
𝐶𝑙𝑎𝑠𝑠2 = 𝑉𝑖𝑟𝑔𝑖𝑛𝑖𝑐𝑎

Generation 등 여러 작업을 확률로 표현할 수 있다.
face 𝑃𝑋
male
female
gender 𝑃𝑌
Sampling
Generation
𝑥~𝑃𝑋(𝑥)
𝑃𝑌|𝑋
Classification
(discrimination)

실제 우리가 얻을 수 있는 데이터는 너무 적다!
Sample space
Event set

차원의 저주
Adopted from https://bigsnarf.wordpress.com/2013/06/14/curse-of-dimensionality/)

기존 ML과 Deep Learning의 차이
Adopted from Goodfellow, 2016

Super high dimensional function
Neural NetworkInput Desired output

Super high dimensional function
Neural Network
Input distribution:
X
Desired output
distribution: Y
𝐹 𝑥 = 𝑃 𝑦 𝑥 =
𝑃 𝑥 𝑦 𝑃(𝑦)
𝑝(𝑥)

Taxonomy of deep learning
Adopted from Goodfellow, 2016

Linear activation function
Input layer Hidden layer 1 Hidden layer 2 Hidden layer 3 Output layer
𝑶𝒖𝒕𝒑𝒖𝒕 = 𝑰𝒏𝒑𝒖𝒕 × 𝑾𝟏 × 𝑾𝟐 × 𝑾𝟑 = 𝑰𝒏𝒑𝒖𝒕 × 𝑾
𝑾𝟏 𝑾𝟐 𝑾𝟑𝑰𝒏𝒑𝒖𝒕 𝑶𝒖𝒕𝒑𝒖𝒕
Input layer Hidden layer Output layer
𝑾𝑰𝒏𝒑𝒖𝒕 𝑶𝒖𝒕𝒑𝒖𝒕

딥러닝의 특징과 Conveltional한 딥러닝
• 딥러닝은 최적화 문제를 잘 푼다.
• 최적화하고자 하는 Objective function을 정해주면 그 objective function을 최적화
하기 위해 feature extraction까지 해가며 최적화하는데, 이는 그 구조가 매우 복잡
하고 높은 성능의 Optimizer 때문에 가능하다.
Conventional한 딥러닝 어플리케이션 설계
1. 학습시키고자 하는 작업을 확률분포로 생각하고, 해당 확률분포를 어떠한 확률분
포로 가정한다.
2. 가정한 확률분포를 근사하기 위한 Objective function(예를 들어 Maximize
likelihood)를 정의하여 Optimization 문제로 바꾼다.
3. Deep Neural Network 구조를 해당하는 작업의 Input에서 특징을 잘 뽑아낼 수 있
게 만든다. (Input이 이미지인 경우 보통 Convolution) (또한 학습이 잘 되게 하기
위한 구조로 만든다. ReLU, Residual connection 등)
4. Deep Learning을 이용하여 학습시킨다. (Network의 Weight를 Optimizer를 이용해
Update한다.)

Bayes’s theorem
𝑃 𝐴 𝐵 =
𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃(𝐵)
A : 남성 or 여성
B: 머리카락의 길이
𝑃 𝐵 𝐴 : 남성 또는 여성의 머리카락이 x cm일 확률
𝑃 𝐴 𝐵 : x cm의 머리카락이 있을 때, 이 머리카락이
남성 또는 여성으로부터 나왔을 확률.

Bayes’s theorem
MLE MAP
남성 90%, 여성 10% 라는 prior가 적용되고 안되고의 차이

Bayes’s theorem
Generating human face
미간 넓이의 분포
턱수염은 남자만 있다.
피부색의 분포
머리카락 길이 분포
남자는 보통 여자보다 얼굴이 크다.
탈모는 남자만 있다.
⋮
• 복잡한 문제의 경우 prior가 뭐가 있는지도 알기 힘들다.
• 잘못된 prior가 입력될 수 있다. (운 좋게 잘못된 걸 알 수 있는 sample을 뽑았을
경우 update 가능)
• Prior의 정확한 분포를 얻기가 힘들다.

Bayesian inference
𝑃 𝐻 𝐸 =
𝑃 𝐸 𝐻 𝑃(𝐻)
𝑝(𝐸)
H : hypothesis
E: evidence
𝑃(𝐻) : Prior probability
𝑃 𝐸 𝐻 : likelihood
𝑃 𝐻 𝐸 : Posterior probability

Bayesian inference
𝑝 𝜃 𝑿, 𝛼 =
𝑝 𝑿 𝜃 𝑝(𝜃|𝛼)
𝑝(𝑿|𝛼)
∝ 𝑝 𝑿 𝜃 𝑝(𝜃|𝛼)
x: a general data point
𝜃: the parameter of the data point’s distribution
𝛼: the hyperparameter of the parameter distribution
𝑿: sample data, a set of n observed data points, i. e., 𝑥1, ⋯ , 𝑥 𝑛
𝑥: a new data point whose distribution is to be predicted.
p(𝜃|𝛼): prior distribution
𝑝 𝑿 𝜃 : sampling distribution
𝑝(𝑿|𝛼): marginal distribution
𝑝 𝜃 𝑿, 𝛼 : posterior distribution

Bayesian prediction
𝑝 𝑥 𝑿, 𝛼 = 𝑝 𝑥 𝜃 𝑝(𝜃|𝑿, 𝛼)𝑑𝜃
Posterior predictive distribution
Prior predictive distribution
𝑝 𝑥 𝛼 = 𝑝 𝑥 𝜃 𝑝 𝜃 𝛼 𝑑𝜃

Bayesian prediction
𝑝 𝑿 𝜃 = 𝑖=1
𝑛
𝑝(𝑥𝑖|𝜃)
If 𝑿 is i.i.d.
Likelihood
Log-Likelihood
log 𝑝 𝑿 𝜃 =
𝑖=1
𝑛
log 𝑝(𝑥𝑖|𝜃)

Maximum likelihood estimation
10 sample 50 sample
키 측정.
Ground truth: 1.78 m, gaussian dist, sigma: 0.1
Sigma는 알고 있다고 가정. 키 측정 (point estimate)
(sample mean값은 week law of large numbers에 의하여 실제 평균으로 수렴함.
(unbiased,

Maximum likelihood estimation
100 sample

Maximize a Posterior estimation
Fine prior Wrong prior
P=0.5인 coin flip에 대한 parameter estimation ( Bernoulli distribution)
Prior (beta distribution – conjugate prior distribution)

No prior (maximum entropy) Strong prior
P=0.5인 coin flip에 대한 parameter estimation ( Bernoulli distribution)
Prior (beta distribution – conjugate prior distribution)

1 sample per step 3 sample per step

Exponential Family
• Normal
• Exponential
• Gamma
• Chi-squared
• Beta
• Dirichlet
• Bernoulli
• Categorical
• Poisson
• Wishart
• Invert Wishart
• geometric
• Binomial (with fixed number of trials)
• Multinomial (with fixed number of trials)
• Negative binomial (with fixed number of failures)

Exponential Family
𝑓𝑥 𝑥 𝜃 = ℎ 𝑥 exp(𝜂 𝜃 ∙ 𝑇 𝑥 − 𝐴 𝜃 )
When 𝑇 𝑥 , ℎ 𝑥 , 𝜂 𝜃 , and 𝐴 𝜃 are known function.
The value 𝜃 is called the parameter of the family.

Exponential Family
𝑓𝑥 𝑥 𝜃 = ℎ 𝑥 exp(𝜂 𝜃 ∙ 𝑇 𝑥 − 𝐴 𝜃 )
𝑃 𝑥 𝜇, 𝜎2
=
1
2𝜋𝜎2
exp −
𝑥 − 𝜇 2
2𝜎2
=
1
2𝜋𝜎2
exp −
1
2𝜎2
(𝑥2 − 2𝑥𝜇 + 𝜇2)
=
1
2𝜋𝜎2
exp −
1
2𝜎2
𝑥2 exp
𝜇
𝜎2
𝑥 −
1
2𝜎2
𝜇2
• ℎ 𝑥 =
1
2𝜋𝜎2
exp −
1
2𝜎2 𝑥2
• 𝜂 𝜃 =
𝜃
𝜎2
• 𝑇 𝑥 = 𝑥
• 𝐴 𝜃 =
1
2𝜎2 𝜃2
• 𝜃 = 𝜇

Exponential Family
• 𝜃 = 𝜇
Gaussian Distribution: Linear regression
• 𝜃 = log(
𝜙
1−𝜙
)
Binomial Distribution: Sigmoid regression
𝜙 =
1
1 + 𝑒−𝜃
Multinomial Distribution: Softmax regression
• 𝜃 = log(
𝜋 𝑘
𝜋 𝐾
) 𝜋 𝑘 =
𝑒 𝜃𝑘
𝑗=1
𝐾
𝑒 𝜃𝑗
𝜇 = 𝜃

Parameter regularization
MAP for 𝜃
argmax
𝜃
𝑃 𝑦 𝑥; 𝜃 𝑃(𝜃)
𝜃 𝑀𝐴𝑃 = argmax
𝜃
log
𝑖=1
𝑚
𝑃 𝑦 𝑖
𝑥 𝑖
; 𝜃 𝑃 𝜃
= argmax
𝜃
𝑖=1
𝑚
𝑙𝑜𝑔𝑃 𝑦 𝑖
𝑥 𝑖
; 𝜃 + log 𝑃 𝜃
log 𝑃 𝜃 = 𝐶1 − 𝐶2 𝜃2
𝐶2 𝜃2maximize minimize

Information theory
Entropy
H 𝑝, 𝑞 = −
𝑖
𝑝𝑖 log 𝑞𝑖
H 𝑝 = −
𝑖
𝑝𝑖log(𝑝𝑖)
Cross Entropy
= 𝐻 𝑝 + 𝐷 𝐾𝐿(𝑝||𝑞)
Kullback-Leibler divergence
𝐷 𝐾𝐿(𝑝| 𝑞 =
𝑖
𝑝𝑖 log
𝑝𝑖
𝑞𝑖
= − 𝑖 𝑝𝑖 log 𝑞𝑖 − ( 𝑖 𝑝𝑖log(𝑝𝑖))

Kullback-leibler divergence
P: target distribution
Q: estimated distribution
Forward KL divergence (MLE) Reverse KL divergence

Jensen-Shannon divergence
Mode collapsing
Jensen-Shannon divergence
JSD(𝑝| 𝑞 =
1
2
𝐷 𝐾𝐿(𝑝| 𝑚 +
1
2
𝐷 𝐾𝐿(𝑞| 𝑚
𝑤ℎ𝑒𝑟𝑒 𝑚 =
1
2
(𝑝 + 𝑞)

Maximum Entropy Distribution
Continuous-value Distribution: Gaussian distribution
Binary classification: Binomial distribution
Multiple class classification: multinomial distribution
Continuous-value Distribution(regression): 주식 가격 변동
Binary classification: 남성 or 여성. True or false.
Multiple class classification: 사자 or 호랑이 or 침팬치

Schedule
• 다음 시간: 딥러닝 (solver, batch, ensemble 등) or 머신러닝에
필요한 통계 기본
• 다다음 시간: 문제를 딥러닝으로 설계하기

Deep learning study 1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep learning study 1

Similar to Deep learning study 1 (20)

More from San Kim

More from San Kim (19)

Deep learning study 1

Editor's Notes