Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

Nguyen A
Clune J
Bengio Y
Dosovitskiy A
Yosinski J

Generating high-resolution, photo-realistic images has been a long-standing goal in machine learning. Recently, Nguyen et al. (2016) showed one interesting way to synthesize novel images by performing gradient ascent in the latent space of a generator network to maximize the activations of one or multiple neurons in a separate classifier network. In this paper we extend this method by introducing an additional prior on the latent code, improving both sample quality and sample diversity, leading to a state-of-the-art generative model that produces high quality images at higher resolutions (227x227) than previous generative models, and does so for all 1000 ImageNet categories. In addition, we provide a unified probabilistic interpretation of related activation maximization methods and call the general class of models "Plug and Play Generative Networks". PPGNs are composed of 1) a generator network G that is capable of drawing a wide range of image types and 2) a replaceable "condition" network C that tells the generator what to draw. We demonstrate the generation of images conditioned on a class (when C is an ImageNet or MIT Places classification network) and also conditioned on a caption (when C is an image captioning network). Our method also improves the state of the art of Multifaceted Feature Visualization, which generates the set of synthetic inputs that activate a neuron in order to better understand how deep neural networks operate. Finally, we show that our model performs reasonably well at the task of image inpainting. While image models are used in this paper, the approach is modality-agnostic and can be applied to many types of data.

Full Citation: Nguyen A, Yosinski J, Bengio Y, Dosovitskiy A, Clune J (2016). Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space. In Computer Vision and Pattern Recognition (CVPR '17), 2017. (pdf)

Code repository for the main experiments in this paper are available here. Also, the code for the predecessor (DGN-AM) of this work is here.

Figure 1: Images synthetically generated by Plug and Play Generative Networks at high-resolution (227x227) for four ImageNet classes. Not only are many images nearly photo-realistic, but samples within a class are diverse. An example t-SNE visualization of 400 samples generated for volcano class is here.

Video 1: Videos of sampling chains (with one sample per frame; no samples filtered out) from PPGN model between 10 different classes and within single classes (Triumph Arch and Junco).

Figure 2: For three class neurons (cardoon, harvester, broccoli) in a pre-trained ImageNet classifier, we show: a) the 9 real training set images that most highly activate that neuron; b) images synthesized by DGN-AM by Nguyen et al. (2016), which are of similar type and diversity to the real top-9 images; c) random real training set images in the class; and d) images synthesized by PPGN, which better represent the diversity of random images from the class.

Figure 3: Images synthesized to match a user text description. A PPGN containing the image captioning model from LRCN model by Donahue et al. (2015) can generate reasonable images that differ based on user-provided captions (e.g. red car vs. blue car, oranges vs. a pile of oranges). For each caption, we show 3 images synthesized starting from random initializations.

Figure 4: We perform class-conditional image sampling to fill in 100x100 missing pixels in a real 227x227 image. While PPGN only conditions on a specific class (top to bottom: volcano, junco, and bell pepper), PPGN-context also constrains the code to produce an image that matches the context region. PPGN-context (c) matches the pixels surrounding the masked region better than PPGN (b), and semantically fills in better than the Context-Aware Fill feature in Photoshop (d) in many cases. The result shows that the class-conditional PPGN does understand the semantics of images. More PPGN-context results are in the paper.

Figure 5: Images synthesized to activate a hidden neuron (number 196) previously identified as a "face detector neuron" by DeepVis toolbox by Yosinski et al. (2015) in the fifth convolutional (conv5) layer of the AlexNet DNN trained on ImageNet. The PPGN uncovers a large diversity of types of inputs that activate this neuron, thus performing Multifaceted Feature Visualization by Nguyen et al. (2016), which sheds light into what the neuron has learned to detect. The different facets include different types of human faces (top row), dog faces (bottom row), and objects that only barely resemble faces (e.g. the windows of a house, or something resembling a green wig). More examples are shown in this t-SNE visualization of 400 samples produced by PPGN in a sampling chain.

Pub. Info: 
In Computer Vision and Pattern Recognition (CVPR '17), IEEE, 2017. Spotlight oral presentation (~10% acceptance rate)
incollection{nguyen2016ppgn title={Plug \& Play Generative Networks: Conditional Iterative Generation of Images in Latent Space}, author={Nguyen, Anh and Yosinski, Jason and Bengio, Yoshua and Dosovitskiy, Alexey and Clune, Jeff}, booktitle={Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on}, year={2017}, organization={IEEE}, url = {} }