GAN 学习笔记(一)

【李宏毅2017 深度学习GAN课程笔记】part 1.

GAN:Generative Adversarial Network

Review: Auto-encoder

Input Image -> NN Encoder -> code -> NN Decoder -> Output Image

Basic Idea of GAN

Maximum Likelihood Estimation

  • Given a data distribution $P_{data}(x)$
  • We have a distribution $P_G(x;\theta)$ parameterized by $\theta$
    • E.g. $P_G(x;\theta)$ is a Gaussian Misture Model, $\theta$ are means and variances of the Gaussians
    • We want to find $\theta$ such that $P_G(x;\theta)$ close to $P_{data}(x)$

Sample ${x^1,x^2,…,x^m}$ from $P_{data}(x)$, we can compute $P_G(x^i;\theta)$

Likelihood of generating the samples:

Find $\theta^*$ maximizing the likelihood:

​ $=argmax \sum\limits_{i=1}^mlogP_{G}(x^i;\theta)$

Now we utilize a neural network to output $P_G(x;\theta)$

  • Generator G

    • G is a function, input $z$, output $x$.
    • Given a prior distribution $P_{prior}(z)$, a probalility distribution $P_G(x)$ id defined by function G.
  • Discriminator D

    • D is a function, input $x$, output scalar.
    • Evaluate the “difference” between $P_G(x)$ and $P_{data}(x)$.
  • There is a function $V(G, D)$.

    $V = E_{x\sim P_{data}}[logD(x)] + E_{x\sim P_G}[log(1-D(x))]$

    Given a generator G, max $V(G,D)$ evalate the “difference” between $P_G$ and $P_{data}$. Pick the G defining $P_G$ most similar to $P_{data}$.

  • Given G, what is the optimal $D^*$ maximizing

    $V = E_{x\sim P_{data}}[logD(x)] + E_{x\sim P_G}[log(1-D(x))]$

    ​ $= \int\limits_{x}P_{data}(x)logD(x)dx + \int\limits_{x}P_G(x)log(1-D(x))dx$

    ​ $=\int\limits_{x}[P_{data}(x)logD(x) + P_G(x)log(1-D(x))]dx$

    Assume that $D(x)$ can have any value here

  • Given $x$, the optimal $D^*$ maximizing

    $\underline{P_{data}(x)}log\underline{D(x)} + \underline{P_G(x)}log(1-\underline{D(x)})$

    We denote $P_{data}(x)=a, D(x)=D, P_G(x)=b$

  • Find $D^*$ maximizing: $f(D)=a\times log(d)+b\times log(1-D)$

    $\frac{df(D)}{dD}=a\times\frac{1}{D} + b\times\frac{1}{1-D}\times(-1)=0$

    $\Rightarrow a \times \frac{1}{D^*}=b\times\frac{1}{1-D^*}$

    $\Rightarrow D^*=\frac{a}{a+b}=\frac{P_{data}(x)}{P_{data}(x)+P_G(x)}$

    $0<D^*<1$

  • $\mathop{max}\limits_{D}V(G,D)=V(G, D^*)$

    $=E_{x\sim P{data}}[log\frac{P_{datta}(x)}{P_{data}(x)+P_G(x)}]+E_{x\sim P_G}[log\frac{P_G(x)}{P_{data}(x)}+P_G(x)]$

    $=\int\limits_{x}P_{data}(x)log\frac{P_{data}(x)}{P_{data}(x)+P_G(x)}dx + \int\limits_{x}P_G(x)log\frac{P_G(x)}{P_{data}(x)+P_G(x)}dx$

    $=-2log2+\int\limits_{x}P_{data}(x)log\frac{P_{data}(x)}{(P_{data}(x)+P_G(x))/2} + \int\limits_{x}P_G(x)log\frac{P_G(x)}{(P_{data}(x)+P_G(x))/2}$

    $=-2log2 + KL(P_{data}(x)||\frac{P_{data}(x)+P_G(x)}{2})+KL(P_G(x)||\frac{P_{data}(x)+P_G(x)}{2})$

    In the end …

  • Generator G, Discriminator D

  • Looking for $G^*$ such that

    $G^*=arg\mathop{min}\limits_{G}\mathop{max}\limits_{D}V(G,D)$

  • Given G, $\mathop{max}\limits_DV(G,D)=-2log2+2JSD(P_{data}(x)||P_G(x))$

  • What is the optimal G?

    $P_G(x)=P_{data}(x)$

Algorithm

  • To find the best G minimizing the loss function $\mathop{max}\limits_{D}V(G,D)=L(G)$. (Here we fix the D)

    $\theta_{G}\gets\theta_{G}-\eta\ \partial L(G)/ \partial \theta_{G}$ , $\theta_{G} $ defines G

  • Given $G_0$

  • Find $D_0^* $ maximizing $V(G_0, D)$

​ $V(G_0,D_0^*)$ is the JS divergence between $P_{data}(x)$ and $P_{G_0}(x)$

  • $\theta_G \gets \theta_G - \eta \partial V(G, D_0^*)/\partial \theta_G\ \Rightarrow$ Obtain $G_1$

  • Find $D_1^*$ maximizing $V(G_1,D)$

$V(G_1,D_1^*)$ is the JS divergence between $P_{data}(x)$ and $P_{G_1}(x)$

  • $\theta_G \gets \theta_G - \eta \partial V(G, D_1^*)/\partial \theta_G\ \Rightarrow$ Obtain $G_2$
  • … …

In practive

  • Given G, how to compute $\mathop{max}\limits_{D}V(G,D)$

    • Sample ${x^1, x^2, …, x^m}$ from $P_{data}(x)$, sample ${\widetilde{x}^1, \widetilde{x}^2,…,\widetilde{x}^m}$ from generator $P_G(x)$

    • Maximize


    Binary Classifier

    Output is $D(x)$ Minimize Cross-entropy

    If $x$ is a positive example $\Rightarrow$ Minimize $-logD(x)$

    If $x$ is a negative example $\Rightarrow$ Minimize $-log(1-D(x))$


    D is a binary classifier (can be deep) with parameters $\theta_d$

    ${x^1, x^2, …, x^m}$ from $P_{data}(x)$ $\Rightarrow$ Positive examples

    ${\widetilde{x}^1, \widetilde{x}^2, …, \widetilde{x}^m}$ from $P_{G}(x)$ $\Rightarrow$ Negative examples

    Minimize

    Maximize

  • In each training iteration:


Learning D, Repeat k times

  • Sample m examples {$x^1, x^2, …, x^m$} from data distribution $P_{data}(x)$
  • Sample m noise samples{$z^1, z^2, …, z^m​$} from the prior $P_{prior}(z)​$
  • Obtaning generated data {$\widetilde{x}^1,\widetilde{x}^2,…,\widetilde{x}^m$}, $\widetilde{x}^i=G(z^i)$
  • Updata discriminator parameters $\theta_d$ to maximize
    • $\widetilde{V}=\frac{1}{m}\sum\limits_{i=1}^{m}logD(x^i) + \frac{1}{m}\sum\limits_{i=1}^{m}log(1-D(\widetilde{x}^i))$
    • $\theta_d \leftarrow \theta_d + \eta \triangledown \widetilde{V}(\theta_d)$ gradient ascend

Learning G, repeat only once

  • Sample another m noise samples {$z^1, z^2, …,z^m$} from the prior $P_{prior}(z)$

  • Update generator parameters $\theta_g$ to minimize

    • $\widetilde{V}=\frac{1}{m}\sum\limits_{i=1}^{m}logD(x^i) + \frac{1}{m}\sum\limits_{i=1}^{m}log(1-\underline{D(G(z^i))})$

    • $\theta_{g}\leftarrow\theta_g-\eta \triangledown\widetilde{V}(\theta_g)$ gradient descent

Real Implementation

$V=E_{x\sim P_{data}}[logD(x)]+\underline{E_{x\sim P_G}[log(1-D(x))}]$ slow at the begining

$V=E_{x \sim P_G[-log(D(x))]}$

Real implementation: label $x$ from $P_G$ as positive