GAN 学习笔记（一）

【李宏毅2017 深度学习GAN课程笔记】part 1.

GAN：Generative Adversarial Network

Review: Auto-encoder

Input Image -> NN Encoder -> code -> NN Decoder -> Output Image

Basic Idea of GAN

Maximum Likelihood Estimation

Given a data distribution $P_{data}(x)$
We have a distribution $P_G(x;\theta)$ parameterized by $\theta$
- E.g. $P_G(x;\theta)$ is a Gaussian Misture Model, $\theta$ are means and variances of the Gaussians
- We want to find $\theta$ such that $P_G(x;\theta)$ close to $P_{data}(x)$

Sample ${x^1,x^2,…,x^m}$ from $P_{data}(x)$, we can compute $P_G(x^i;\theta)$

Likelihood of generating the samples:

$L=\prod\limits_{i=1}^mP_{G}(x^i;\theta)$

Find $\theta^*$ maximizing the likelihood:

$\theta^*=argmax\prod\limits_{i=1}^mP_{G}(x^i;\theta) = argmax\ log\prod\limits_{i=1}^mP_{G}(x^i;\theta)$

$=argmax \sum\limits_{i=1}^mlogP_{G}(x^i;\theta)$

$\approx argmax\ E_{x\sim P_{data}}[logP_{G}(x;\theta)]$

$= arg max \int P_{data}(x)logP_G(x;\theta)dx - \int P_{data}(x)logP_{data}(x)dx$

$= argmax\int P_{data}(x)log\frac{P_{data}(x)}{P_G(x;\theta)}$

$= argmin \int P_{data}(x)log\frac{P_G(x;\theta)}{P_{data}(x)}$

$= argmin\ KL(P_{data}\|\|P_G(x;\theta))$

Now we utilize a neural network to output $P_G(x;\theta)$

Generator G
- G is a function, input $z$, output $x$.
- Given a prior distribution $P_{prior}(z)$, a probalility distribution $P_G(x)$ id defined by function G.
Discriminator D
- D is a function, input $x$, output scalar.
- Evaluate the “difference” between $P_G(x)$ and $P_{data}(x)$.
There is a function $V(G, D)$.
$G^*=arg\mathop{min}\limits_{G}\mathop{max}\limits_{D}V(G,D)$
$V = E_{x\sim P_{data}}[logD(x)] + E_{x\sim P_G}[log(1-D(x))]$

Given a generator G, max $V(G,D)$ evalate the “difference” between $P_G$ and $P_{data}$. Pick the G defining $P_G$ most similar to $P_{data}$.
Given G, what is the optimal $D^*$ maximizing

$V = E_{x\sim P_{data}}[logD(x)] + E_{x\sim P_G}[log(1-D(x))]$

$= \int\limits_{x}P_{data}(x)logD(x)dx + \int\limits_{x}P_G(x)log(1-D(x))dx$

$=\int\limits_{x}[P_{data}(x)logD(x) + P_G(x)log(1-D(x))]dx$

Assume that $D(x)$ can have any value here
Given $x$, the optimal $D^*$ maximizing

$\underline{P_{data}(x)}log\underline{D(x)} + \underline{P_G(x)}log(1-\underline{D(x)})$

We denote $P_{data}(x)=a, D(x)=D, P_G(x)=b$
Find $D^*$ maximizing: $f(D)=a\times log(d)+b\times log(1-D)$

$\frac{df(D)}{dD}=a\times\frac{1}{D} + b\times\frac{1}{1-D}\times(-1)=0$

$\Rightarrow a \times \frac{1}{D^*}=b\times\frac{1}{1-D^*}$

$\Rightarrow D^*=\frac{a}{a+b}=\frac{P_{data}(x)}{P_{data}(x)+P_G(x)}$

$0<D^*<1$
$\mathop{max}\limits_{D}V(G,D)=V(G, D^*)$

$=E_{x\sim P{data}}[log\frac{P_{datta}(x)}{P_{data}(x)+P_G(x)}]+E_{x\sim P_G}[log\frac{P_G(x)}{P_{data}(x)}+P_G(x)]$

$=\int\limits_{x}P_{data}(x)log\frac{P_{data}(x)}{P_{data}(x)+P_G(x)}dx + \int\limits_{x}P_G(x)log\frac{P_G(x)}{P_{data}(x)+P_G(x)}dx$

$=-2log2+\int\limits_{x}P_{data}(x)log\frac{P_{data}(x)}{(P_{data}(x)+P_G(x))/2} + \int\limits_{x}P_G(x)log\frac{P_G(x)}{(P_{data}(x)+P_G(x))/2}$

$=-2log2 + KL(P_{data}(x)||\frac{P_{data}(x)+P_G(x)}{2})+KL(P_G(x)||\frac{P_{data}(x)+P_G(x)}{2})$

In the end …
Generator G, Discriminator D
Looking for $G^*$ such that

$G^*=arg\mathop{min}\limits_{G}\mathop{max}\limits_{D}V(G,D)$
Given G, $\mathop{max}\limits_DV(G,D)=-2log2+2JSD(P_{data}(x)||P_G(x))$
What is the optimal G?

$P_G(x)=P_{data}(x)$

Algorithm

To find the best G minimizing the loss function $\mathop{max}\limits_{D}V(G,D)=L(G)$. (Here we fix the D)

$\theta_{G}\gets\theta_{G}-\eta\ \partial L(G)/ \partial \theta_{G}$ , $\theta_{G} $ defines G
Given $G_0$
Find $D_0^* $ maximizing $V(G_0, D)$

$V(G_0,D_0^*)$ is the JS divergence between $P_{data}(x)$ and $P_{G_0}(x)$

$\theta_G \gets \theta_G - \eta \partial V(G, D_0^*)/\partial \theta_G\ \Rightarrow$ Obtain $G_1$
Find $D_1^*$ maximizing $V(G_1,D)$

$V(G_1,D_1^*)$ is the JS divergence between $P_{data}(x)$ and $P_{G_1}(x)$

$\theta_G \gets \theta_G - \eta \partial V(G, D_1^*)/\partial \theta_G\ \Rightarrow$ Obtain $G_2$
… …

In practive

Given G, how to compute $\mathop{max}\limits_{D}V(G,D)$
- Sample ${x^1, x^2, …, x^m}$ from $P_{data}(x)$, sample ${\widetilde{x}^1, \widetilde{x}^2,…,\widetilde{x}^m}$ from generator $P_G(x)$
- Maximize $\widetilde{V}=\frac{1}{m}\sum\limits_{i=1}^mlogD(x^i)+\frac{1}{m}\sum\limits_{i=1}^mlog(1-D(\widetilde{x}^i))$
Binary Classifier

Output is $D(x)$ Minimize Cross-entropy

If $x$ is a positive example $\Rightarrow$ Minimize $-logD(x)$

If $x$ is a negative example $\Rightarrow$ Minimize $-log(1-D(x))$

D is a binary classifier (can be deep) with parameters $\theta_d$

${x^1, x^2, …, x^m}$ from $P_{data}(x)$ $\Rightarrow$ Positive examples

${\widetilde{x}^1, \widetilde{x}^2, …, \widetilde{x}^m}$ from $P_{G}(x)$ $\Rightarrow$ Negative examples

Minimize $L=-\frac{1}{m}\sum\limits_{i=1}^mlogD(x^i)-\frac{1}{m}\sum\limits_{i=1}^mlog(1-D(\widetilde{x}^i))$

Maximize $\widetilde{V}=\frac{1}{m}\sum\limits_{i=1}^mlogD(x^i)+\frac{1}{m}\sum\limits_{i=1}^mlog(1-D(\widetilde{x}^i))$
In each training iteration:

Learning D, Repeat k times

Sample m examples {$x^1, x^2, …, x^m$} from data distribution $P_{data}(x)$
Sample m noise samples{$z^1, z^2, …, z^m$} from the prior $P_{prior}(z)$
Obtaning generated data {$\widetilde{x}^1,\widetilde{x}^2,…,\widetilde{x}^m$}， $\widetilde{x}^i=G(z^i)$
Updata discriminator parameters $\theta_d$ to maximize
- $\widetilde{V}=\frac{1}{m}\sum\limits_{i=1}^{m}logD(x^i) + \frac{1}{m}\sum\limits_{i=1}^{m}log(1-D(\widetilde{x}^i))$
- $\theta_d \leftarrow \theta_d + \eta \triangledown \widetilde{V}(\theta_d)$ gradient ascend

Learning G, repeat only once

Sample another m noise samples {$z^1, z^2, …,z^m$} from the prior $P_{prior}(z)$
Update generator parameters $\theta_g$ to minimize
- $\widetilde{V}=\frac{1}{m}\sum\limits_{i=1}^{m}logD(x^i) + \frac{1}{m}\sum\limits_{i=1}^{m}log(1-\underline{D(G(z^i))})$
- $\theta_{g}\leftarrow\theta_g-\eta \triangledown\widetilde{V}(\theta_g)$ gradient descent

Real Implementation

$V=E_{x\sim P_{data}}[logD(x)]+\underline{E_{x\sim P_G}[log(1-D(x))}]$ slow at the begining

$V=E_{x \sim P_G[-log(D(x))]}$

Real implementation: label $x$ from $P_G$ as positive