【李宏毅2017 深度学习GAN课程笔记】part 1.
GAN:Generative Adversarial Network
Review: Auto-encoder
Input Image -> NN Encoder -> code -> NN Decoder -> Output Image
Basic Idea of GAN
Maximum Likelihood Estimation
- Given a data distribution $P_{data}(x)$
- We have a distribution $P_G(x;\theta)$ parameterized by $\theta$
- E.g. $P_G(x;\theta)$ is a Gaussian Misture Model, $\theta$ are means and variances of the Gaussians
- We want to find $\theta$ such that $P_G(x;\theta)$ close to $P_{data}(x)$
Sample ${x^1,x^2,…,x^m}$ from $P_{data}(x)$, we can compute $P_G(x^i;\theta)$
Likelihood of generating the samples:
Find $\theta^*$ maximizing the likelihood:
$=argmax \sum\limits_{i=1}^mlogP_{G}(x^i;\theta)$
Now we utilize a neural network to output $P_G(x;\theta)$
-
Generator G
- G is a function, input $z$, output $x$.
- Given a prior distribution $P_{prior}(z)$, a probalility distribution $P_G(x)$ id defined by function G.
-
Discriminator D
- D is a function, input $x$, output scalar.
- Evaluate the “difference” between $P_G(x)$ and $P_{data}(x)$.
-
There is a function $V(G, D)$.
$V = E_{x\sim P_{data}}[logD(x)] + E_{x\sim P_G}[log(1-D(x))]$
Given a generator G, max $V(G,D)$ evalate the “difference” between $P_G$ and $P_{data}$. Pick the G defining $P_G$ most similar to $P_{data}$.
-
Given G, what is the optimal $D^*$ maximizing
$V = E_{x\sim P_{data}}[logD(x)] + E_{x\sim P_G}[log(1-D(x))]$
$= \int\limits_{x}P_{data}(x)logD(x)dx + \int\limits_{x}P_G(x)log(1-D(x))dx$
$=\int\limits_{x}[P_{data}(x)logD(x) + P_G(x)log(1-D(x))]dx$
Assume that $D(x)$ can have any value here
-
Given $x$, the optimal $D^*$ maximizing
$\underline{P_{data}(x)}log\underline{D(x)} + \underline{P_G(x)}log(1-\underline{D(x)})$
We denote $P_{data}(x)=a, D(x)=D, P_G(x)=b$
-
Find $D^*$ maximizing: $f(D)=a\times log(d)+b\times log(1-D)$
$\frac{df(D)}{dD}=a\times\frac{1}{D} + b\times\frac{1}{1-D}\times(-1)=0$
$\Rightarrow a \times \frac{1}{D^*}=b\times\frac{1}{1-D^*}$
$\Rightarrow D^*=\frac{a}{a+b}=\frac{P_{data}(x)}{P_{data}(x)+P_G(x)}$
$0<D^*<1$
-
$\mathop{max}\limits_{D}V(G,D)=V(G, D^*)$
$=E_{x\sim P{data}}[log\frac{P_{datta}(x)}{P_{data}(x)+P_G(x)}]+E_{x\sim P_G}[log\frac{P_G(x)}{P_{data}(x)}+P_G(x)]$
$=\int\limits_{x}P_{data}(x)log\frac{P_{data}(x)}{P_{data}(x)+P_G(x)}dx + \int\limits_{x}P_G(x)log\frac{P_G(x)}{P_{data}(x)+P_G(x)}dx$
$=-2log2+\int\limits_{x}P_{data}(x)log\frac{P_{data}(x)}{(P_{data}(x)+P_G(x))/2} + \int\limits_{x}P_G(x)log\frac{P_G(x)}{(P_{data}(x)+P_G(x))/2}$
$=-2log2 + KL(P_{data}(x)||\frac{P_{data}(x)+P_G(x)}{2})+KL(P_G(x)||\frac{P_{data}(x)+P_G(x)}{2})$
In the end …
-
Generator G, Discriminator D
-
Looking for $G^*$ such that
$G^*=arg\mathop{min}\limits_{G}\mathop{max}\limits_{D}V(G,D)$
-
Given G, $\mathop{max}\limits_DV(G,D)=-2log2+2JSD(P_{data}(x)||P_G(x))$
-
What is the optimal G?
$P_G(x)=P_{data}(x)$
Algorithm
-
To find the best G minimizing the loss function $\mathop{max}\limits_{D}V(G,D)=L(G)$. (Here we fix the D)
$\theta_{G}\gets\theta_{G}-\eta\ \partial L(G)/ \partial \theta_{G}$ , $\theta_{G} $ defines G
-
Given $G_0$
-
Find $D_0^* $ maximizing $V(G_0, D)$
$V(G_0,D_0^*)$ is the JS divergence between $P_{data}(x)$ and $P_{G_0}(x)$
-
$\theta_G \gets \theta_G - \eta \partial V(G, D_0^*)/\partial \theta_G\ \Rightarrow$ Obtain $G_1$
-
Find $D_1^*$ maximizing $V(G_1,D)$
$V(G_1,D_1^*)$ is the JS divergence between $P_{data}(x)$ and $P_{G_1}(x)$
- $\theta_G \gets \theta_G - \eta \partial V(G, D_1^*)/\partial \theta_G\ \Rightarrow$ Obtain $G_2$
- … …
In practive
-
Given G, how to compute $\mathop{max}\limits_{D}V(G,D)$
-
Sample ${x^1, x^2, …, x^m}$ from $P_{data}(x)$, sample ${\widetilde{x}^1, \widetilde{x}^2,…,\widetilde{x}^m}$ from generator $P_G(x)$
-
Maximize
Binary Classifier
Output is $D(x)$ Minimize Cross-entropy
If $x$ is a positive example $\Rightarrow$ Minimize $-logD(x)$
If $x$ is a negative example $\Rightarrow$ Minimize $-log(1-D(x))$
D is a binary classifier (can be deep) with parameters $\theta_d$
${x^1, x^2, …, x^m}$ from $P_{data}(x)$ $\Rightarrow$ Positive examples
${\widetilde{x}^1, \widetilde{x}^2, …, \widetilde{x}^m}$ from $P_{G}(x)$ $\Rightarrow$ Negative examples
Minimize
Maximize
-
-
In each training iteration:
Learning D, Repeat k times
- Sample m examples {$x^1, x^2, …, x^m$} from data distribution $P_{data}(x)$
- Sample m noise samples{$z^1, z^2, …, z^m$} from the prior $P_{prior}(z)$
- Obtaning generated data {$\widetilde{x}^1,\widetilde{x}^2,…,\widetilde{x}^m$}, $\widetilde{x}^i=G(z^i)$
- Updata discriminator parameters $\theta_d$ to maximize
- $\widetilde{V}=\frac{1}{m}\sum\limits_{i=1}^{m}logD(x^i) + \frac{1}{m}\sum\limits_{i=1}^{m}log(1-D(\widetilde{x}^i))$
- $\theta_d \leftarrow \theta_d + \eta \triangledown \widetilde{V}(\theta_d)$ gradient ascend
Learning G, repeat only once
-
Sample another m noise samples {$z^1, z^2, …,z^m$} from the prior $P_{prior}(z)$
-
Update generator parameters $\theta_g$ to minimize
-
$\widetilde{V}=\frac{1}{m}\sum\limits_{i=1}^{m}logD(x^i) + \frac{1}{m}\sum\limits_{i=1}^{m}log(1-\underline{D(G(z^i))})$
-
$\theta_{g}\leftarrow\theta_g-\eta \triangledown\widetilde{V}(\theta_g)$ gradient descent
-
Real Implementation
$V=E_{x\sim P_{data}}[logD(x)]+\underline{E_{x\sim P_G}[log(1-D(x))}]$ slow at the begining
$V=E_{x \sim P_G[-log(D(x))]}$
Real implementation: label $x$ from $P_G$ as positive