4/13/2017 Generative Adversarial Networks, and Applications Ali Mirzaei Nimish Srivastava Kwonjoon Lee CSE 252C Songting Xu 4/12/17 2/44 Outline: • Generative Models vs Discriminative Models (Background) • Generative Adversarial Networks (GANs) • Mathematical Proofs • GAN Results • GAN Failures • Deep Convolutional GAN Architecture (Related Work) • GAN-based Applications (Related Work) 1 4/13/2017 3/44 Generative Models vs Discriminative Models: • Data: Pixels • Features: X • Labels: Y (Zebra, No zebra) • Discriminative model: model p(y|x). • Generative model: model p(x,y). Learn p(y|x) indirectly using Bayes rule. (Bayes rule: p(y|x)∝ p(x|y)p(y)) 4/44 Discriminative Model: • Map each sample to feature space. • Learns decision boundary. • For each point (x’) in feature space: ▫ P(Zebra|x’)+P(No Zebra|x’)=1 2 4/13/2017 5/44 Generative Model: Learns probability distribution of each class of data: ∫P(x|Zebra)dx=1 ∫P(x|No Zebra)dx=1 6/44 Generative Adversarial Networks (GANs): There is a game between two networks: • Generative network. • Discriminative network. Example: Counterfeit Money. • G is a crook and is trying to generate fake money. • D is a teller at the Bank. G’s goal is to generate samples that D classifies as real. G should learn the underlying distribution of real money. 3 4/13/2017 7/44 Generative Network Discriminator Network Input: A real image (x), or a fake Input: Random noise (z). image (x’=G(z)). Output: A fake image (x’=G(z)). Output: Probability that input is real. x D P (input is real) Input noise G (z) x’ D’s goal: G’s goal: • D(G(z))=1 • D(x)=1 • D(G(z))=0 8/44 Objective Function: Remember that: x: Real image G(z): Fake image D’s goals: • D(x)=1 • D(G(z))=0 G’s goal: • D(G(z))=1 4 4/13/2017 9/44 Desired Convergence: We will prove that the models will approach equilibrium: • D(x)=0.5 • D(G(z))=0.5 Dcannotdiscriminate. Gwill learn underlying distribution of real data. (P ) data Training Algorithm & Mathematical Proofs 5 4/13/2017 11/44 Training Algorithm: From: Generative Adversarial Nets, Goodfellowet al, 2014 12/44 Training Algorithm: • The Algorithm takes k steps to optimize for the Discriminative net, D, and then one step of optimizing G, the generative net based on the outcome from D after above k steps. • While optimizing for D, we work on the entire value function. For G, we only use the second term of the value function, as the gradient w.r.t. for the first term is zero. 6 4/13/2017 13/44 Training Algorithm: • It is important to note that we ascend the gradient when optimizing for D, as it is a maximization problem, and descend the gradient when optimizing for G as it is a minimization problem. • Initially when G is far from optimal, the gradients in second optimization might be very small. Instead, we can ascend the gradient for log(D(G(z))) initially. 14/44 Training Algorithm: From: Generative Adversarial Nets, Goodfellowet al, 2014 7 4/13/2017 15/44 Global Optimality of p =p g data This is done using • A proposition for finding an optimal discriminator D for a fixed G (Proposition 1) • A theorem for finding the global minimum for G (Theorem 1) • Proposition 1: For G fixed, the optimal discriminator D is: p (x) data D * (x) G p (x) p (x) data g 16/44 Proof for Proposition 1: • Expected Value: E[g(x)] p(x)g(x) p(x)g(x)dx x • Change of Variable: ▫ If Y = u(X) and u is invertible, then X=v(Y) where v is the inverse of u. By using the change of variable rule on the density functions for the distribution of Y to the distribution of X, we get: • Now: V(G,D) E [log(D(x))]E [log(1D(G(z)))] x ~ pdata(x) z ~ pz(z) p (x)log(D(x))dx p (z)log(1D(G(z)))dz data z x z p (x)log(D(x))dx p (x)log(1D(x))dx data g x 8 4/13/2017 17/44 Proof for Proposition 1 (Contd.”): • From the last equation on previous slide: V(G,D) (p (x)log(D(x))p (x)log(1 D(G(x))))dx data g x 𝑎 • For a function f: y →a×log(y) + b×log(1-y), the maximumoccurs at 𝑦 = . 𝑎+𝑏 • Therefore: p (x) D*(x) data G p (x) p (x) data g 18/44 Global Optimality of p =p g data • For a given G, the minmax value function can be defined as: C(G) max V(G,D) D E [log(D*(x))]E [log(1D*(x))] x ~ pdata G x ~ pg G p (x) p (x) data g E [log( )]E [log( )] x ~ pdata x ~ pg p (x) p (x) p (x) p (x) data g data g • Theorem 1: The global minimumof the virtual training criterion for G, i.e. C(G) is achieved if and only if p =p . At that point, C(G) achieves the value of -log(4). g data 9 4/13/2017 19/44 Proof of Theorem 1: • For p =p , D* = 1/2; So, g data G C(G)E [lo2)g]E([lo2)g]l(o4)g( x~pdata x~pg • Also, p (x) p (x) data g C(G)E [log2log2log( )]E [log2log2log( )] x~ pdata x~ pg p (x) p (x) p (x) p (x) data g data g p (x) p (x) data g log(4) p (x)log( ) p (x)log( ) data g (p (x) p (x))/2 (p (x) p (x))/2 data g data g x~pdata(x) x~pg(x) 20/44 Proof of Theorem 1 (Contd.”): • KL-Divergence between two probability distributions p and q: KL(p||q)pilog(pi)p(x)log(p(x))dx i qi x q(x) • From the last equation on previous slide: p (x) p(x) C(G)log4() p (x)log( data ) p(x)log( g ) data g (p (x)p(x))/2 (p (x)p(x))/2 x~pdat(ax) data g x~pg(x) data g 𝑝 +𝑝 𝑝 +𝑝 𝑑𝑎𝑡𝑎 𝑔 𝑑𝑎𝑡𝑎 𝑔 𝐾𝐿(𝑝 || ) 𝐾𝐿(𝑝 || ) 𝑑𝑎𝑡𝑎 2 𝑔 2 • 𝐾𝐿(𝑝||𝑞) ≥ 0 for all 𝑝,𝑞; equality holds when p=q. • Therefore the minimum of C(G)=-log(4) obtained when p =p . g data 10
Description: