Deep Learning for Computer Vision Spring 2017: Thurs April 20: Image-to-Image Translation

Wednesday, April 19, 2017

Thurs April 20: Image-to-Image Translation

Image-to-Image Translation with Conditional Adversarial Networks Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros

18 comments:

Chris MattioliApril 19, 2017 at 12:22 PM
(Chris Mattioli) "Image to Image Translation with Conditional Adversarial Networks" proposes a general solution to the problem of image-to-image translation, predicting pixels from pixels. CNN's can be effective, but the choice of loss function can have varying results often requiring hand-tuning to get just right for the desired application. The generative adversial network (GAN) gets around this problem, and the paper proposes these models are the solution of the problem of image-to-image translation. They phrase these models as being conditional on the input image, and therefore, they call them cGANs. The cGAN loss function is tradeoff between the generated image and the score of the discriminating model, i.e., the model that assesses how well the generator has done. This is combined with a more traditional loss to create the final objective (eq 4). The greater architecture of these models are conv-batchnorm-relu, but with a few adjustments. The generator portion is built with a "U-net" architecture. This is similar to the general autoencoder, but includes skips from a layer in the encoder to its reflection in the decoder. It was shown in their results that this is an effective strategy because it learned to generate more realistic images. The discrimator portion is tweaked to only penalize structure at the scale of patches. This is because the more basic norms (L1 or L2) do a good job of capturing the low frequencies. They showed in their results the combination of the two were quite effective. The L1 leading to blurry results alone and the cGAN leading to sharp results alone. Outside of this is the classic SGD with batch normalization using the statistics of the test batch.
Question: Why is it important to use the "U-net" approach to the generator? They claim they want to circumvent the bottleneck, but I had a hard time trying to understand why.
ReplyDelete
Replies
UnknownApril 19, 2017 at 2:52 PM
Xinmeng’s Summary

The paper present a conditional adversarial network to give the solution to the image-to-image translation. The framework was able to work as general purpose, as each case it has the same objective and architecture, the difference is the training data. The main advantage is the non application specific and they use “U-Net”-based architecture for generator and a convolutional “PatchGAN” classifier for the discriminator. The method is to train a conditional GAN to predict image from original image. The discriminator learns to classify between real and synthesized pairs, The generator learns to fool the discriminator observe an input image. Both generator and discriminator use modules of the form convolution-BatchNorm-ReLu. Some optimization are taken place, the paper use L1 distance rather than L2 distance to encourage less blurring since the Euclidean distance tend to average the distance and cause blurring. The author also mentioned using batch size of 1 to have instance normalization. As the result, the method was specifically engineered to do well on colorization.
Discussion: Could you explain why to use the batch size of 1? Still have trouble understand the term instance normalization. What is difference of doing and not doing the batch normalization in this way?
ReplyDelete
Replies
UnknownApril 19, 2017 at 4:21 PM
This paper presents a mathematical framework that allows us to train Generative Adversarial Networks to produce images that depend on an observed image prior. The key insight here is to train both a discriminator that consumes both an image and some kind of image prior/ground truth. This way, we are able to semantically guide the trained cGan to produce images that "flesh out" the prior that we pass it.

Discussion:
The failure mode that the authors describe for colorization seem bizarre. Why would a cGAN trained to colorize produce greyscale output and do nothing to an inputted image? Is there no change at all to the input image?

Do we see a similar failure mode in other tasks (e.g, the cGan does nothing to
a line drawing of a bag)?
ReplyDelete
Replies
JonApril 19, 2017 at 6:20 PM
This paper presents the use of conditional generative adversarial networks as a general purpose solution to the task of image-to-image translation. This is done by training a generative adversarial network to produce not only a mapping from image to output with the generative network, but also to learn the specific loss function through a discriminator network. A U-net architecture was adopted for the generator that had skip connections to encode low-level features. For the discriminator, and L1 loss to capture low frequency features was combined with a patchgan loss (penalizes structure at the scale of patches rather than the full image) to capture high-frequency features. This strategy was used on several image-image translation tasks, including semantic labels to photos, map to arial pohots, black and white to color images, and several others. The output of the cGAN was compared with respective baselines through AMT perception studies as well as FCN-score. While this method did not result in state of the art in most of the tasks, it was found to be an effective multipurpose method.
Discussion:
How does the batch normalization summary statistics work when the batch size is 1 at test time as described in 2.3?

-Jonathan Hohrath
ReplyDelete
Replies
UnknownApril 19, 2017 at 7:42 PM
This paper introduces a framework for image-to-image translation problems. They use conditional generative adversarial networks and learn a structured loss function, making their method fairly generalizable. Previous to this paper, we've seen GANs used to map noise to images. Conditional GANs instead map images plus noise to images. This lets the authors use them for tasks like coloration and scene reconstruction. By learning a loss function, the technique is able to perform reasonably well in a wide variety of tasks.

In all of the examples they give, the outputs are nearly identical to the inputs in position and orientation, which makes me wonder to what degree the network is learning the semantics of the objects in question. For a task like edges to images, if we distorted the generator input edges would it force the network to learn more of the object semantics instead of just texture?

-Jay DeStories
ReplyDelete
Replies
UnknownApril 19, 2017 at 7:47 PM
This paper proposes a general-purpose solution to image-to-image translation problems which not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. Authors explore GANs in the conditional setting called cGANs which learn a conditional generative model and is suitable for image to image translation tasks. When training cGANs, discriminator learns to classify between real and synthesized paris and generators learns to fool the discriminator and also to be near the ground truth output in an L2 sense. Network architectures are form convolution-BatchNorm-ReLu for both discriminator and generator.Use minibatch SGD and adam solver as optimization. In the experiment part, they test the method on a variety of tasks and datasets based on several different architectures and shows that simple framework is sufficient to achieve better results.
Q: Why L1+cGAN has a common failure mode that produces grayscale or desaturated result.
ReplyDelete
Replies
Dylan CashmanApril 19, 2017 at 8:17 PM
This paper is really amazing. The authors use a GAN to learn image-to-image mappings. For example, they may learn to generate images of handbags. They use a conditional GAN that allows them to input a sketch of a handbag, and the GAN will produce an image of a handbag that most closely resembles the input. In order to force the GAN to produce realistic images, they modify the basic GAN architecture. They use an L1 loss to avoid the blurriness of L2 distances. In addition, they use a Patch discriminator - instead of judging the entire picture, they convolve a judgment across the picture. This allows the discriminator to focus on high frequencies and ignore the noisiness. They show how they experimentally learn the optimal size of these patches. Their results are really qualitatively fascinating. They map between maps and aerial photos, and vice versa. They automatically colorize images. They convert sketches and plans to real images. It's easy to see that there would be some real commercial applications. However, there are also some failure cases that may give pause.

Question:

In training, do we need the "sketches" for each training example? It seems like you could learn a regular GAN on a really large training set, and then learn how to encode the sketches on a much smaller training subset.

The paper also got me thinking about generating 3-d models. Suppose it generated architectural plans for a 3-d structure, like a house. Would there be a way to force the GAN to only produce plans that were physically possible? In other words, is there a way to force a logic system on the results produced by the GAN? Maybe we can give the generator a limited set of basic building blocks instead of all possible pixels.
ReplyDelete
Replies
UnknownApril 19, 2017 at 8:44 PM
This paper takes the model from Goodfellow 2014 generative adversarial network paper. The difference is that the generator sees the input image too. The generator creates an output image off of this input, and the discriminator must distinguish when this output comes from real data (i.e. have real relationship with the input) and when it comes from the generator. Using GAN is definitely better than L1/L2 because the latter tends to blur and grey out the colors. GAN + L1 regularization seems to work for some tasks. The results are OK; the model is better for producing images from sketches than the other way. They do poorly on AMT experiments, so there’s room for improvement (for the filed as a whole). Since everything they do here is based on Goodfellow 2014, I wonder if they could take the ideas from the paper we read last week – I remember that paper improved on Goodfellow 2014.
ReplyDelete
Replies
UnknownApril 19, 2017 at 10:48 PM
The researchers in the paper present a generic method to image-to-image translation problems using conditional adversarial networks. Conditional GANs learn a mapping from an input image x and a random noise vector z, to y. The researchers add skip connections as a way to give the generator a way to get around the bottleneck for information. They also mention about restricting the GAN discriminator to only model high-frequency structures. In evaluating the quality of the generated images, the researchers used two methods: a study on Amazon Mechanical Turk in determining if an image is real or fake and a metric in measuring whether or not the generated images are realistic enough such that an existing recognition system can recognize them.

Question:
I am also a bit confused as to how the skip connections work.
ReplyDelete
Replies
UnknownApril 20, 2017 at 8:27 AM
Nathan Watts' summary:

This paper presents an architecture which represents a general solution to many different image-to-image problems, which previously only had individual solutions. Their architecture is an extension of GANs called “conditional GANs.” In a cGAN, the in addition to the uniform noise, the generator is conditioned on the input image, and the discriminator is trained to distinguish real and fake input/output pairs. Thus, the pair is labeled fake if the output doesn’t look real or if it doesn’t appear to correspond to the input image given. Both modules use convolution-Batchnorm-ReLU aso layers. Encoder-decoder and u-net architectures were tested for the generator, the latter providing much better results. They tested the architecture on several different problems and found that it consistently performs well on a number of problems. Additionally, they found that it helped to also include an L1 loss- while it didn’t dramatically improve subjective image quality, it improved accuracy, because it forces the network to be more faithful to the ground truth and stop hallucinating details where none are needed. They also found that changing the patch size in the discriminator can affect the results- they found 70 x 70 px to be ideal for one problem through cross-validation. While this method works very well for many problems where the level of detail is increased, on problems where the level of detail is decreased (semantic segmentation, for example), this method works but underperforms state-of-the-art.

Question: if this method gets good enough, could it potentially be used as a data augmentation technique for a segmentation network, for example? If not, what problems would it introduce?
ReplyDelete
Replies
UnknownApril 20, 2017 at 10:16 AM
This paper presents the usage of conditional generative adversarial networks for generating an image from an input image. The paper applies this framework to the tasks of coloring black and white photos, turning daytime photos into nighttime, and generating realistic photos from a segmented image. Apart from a traditional conditional GAN framework, this paper employs the U-net architecture in the generator and a “Patch GAN” structure in the discriminator, which has the discriminator judge input images as a series of patches rather than just the entire image. To evaluate their networks, the authors had workers on mechanical Turk judge whether images were real or generated after seeing them for only one second. They also used an image segmentation network (FCN-8) trained on real photos to segment the generated photos.

Discussion:
What is the cause of the failure mode for the colorization task?
ReplyDelete
Replies
AnonymousApril 20, 2017 at 11:42 AM
Summary by Jason Krone:

This paper uses GANs conditioned on a feature representation of an input image to generate output images for various translation tasks including day to night, BW to color, and colorization. The authors modify the generator network to use an encoder decoder structure to allow conditioning on input images. The encoder portion of the network produces a feature representation of the image and the decoder portion of the network takes this feature representation and produces a generated image. Essentially, the image features take the place of the random noise (z) typically used to seed GANs. In an effort to prevent the generator from losing all low level information on the input image at later layers in the network, the authors add skip connections to the generator network in the shape of a “U-Net”. They also try the use of a PatchGAN discriminator, which returns a confidence for a given patch in the input image as real (toward 1) or fake (toward 0), which achieves superior per class accuracy than existing methods.

Questions:
- What are the potential downfalls of using FCN-score to evaluate the model?
ReplyDelete
Replies
ColeApril 20, 2017 at 2:52 PM
The authors present the use of an adversarial network for general purpose image to image translation. The loss function is hard to get right and usually involves trial and error, so they train a cGAN (conditional GAN) to learn a loss function (through a discriminator) and also to do the image to image mapping. The generator uses U-net architecture (which is like an auto encoder but with skip layers). The discriminator uses L1 loss (low frequency features) and some scale modifications for working on patches (high frequency patches).
They used their network on a few tasks - synthesizing photos from label maps, colorizing images, and reconstructing objects. The results showed that their performance was not usually the best, but it was generally good enough to use their network for multiple purposes.
I would be interested to learn about other networks that train for a loss function.
ReplyDelete
Replies
UnknownApril 20, 2017 at 3:28 PM
This paper describes the architecture of and discusses experimental results of the Conditional GAN (cGAN) model of image-to-image generation. The model seeks to serve as a framework for tasks which take images in order to generate images, such as generating segmentation maps, generating images from segmentation maps, colorizing images, producing images from their extracted edges, etc. The model improves upon the previously seen GAN model in that the input to the Generator is a prompt image and the discriminator also receives a prompt image and must determine whether the pair of images is real or generated. The other most notable improvement is that the described architecture uses a U-net in order to preserve low-level features when generating the images, as opposed to the typical model which bottlenecks the pipeline in the middle before producing a full image. One big finding of the paper is the improvement in sharpness of the resultant image of a cGAN, as opposed to networks which are trained using loss functions such as L1 or L2. In general, the results of cGANs are pretty realistic looking for most realistic hallucination applications, whereas they do not quite perform as well as prior models for more objective tests such as segmentation.

Discussion: I’m confused how they generate in patches and then turn this into a full resultant image? This seems to be responsible for some of the artifacts that come out of the generated images.

Ben Papp
ReplyDelete
Replies
UnknownApril 20, 2017 at 3:38 PM
This paper presents the use of image-conditional generative adversarial network for general image to image translation tasks. Conditional GANs take in an observed image and random noise z, to create/ translate to a different image. The generative network is adversarially trained by a discriminator that takes in the input/output pair and learns to recognize "fake pair"

The generative network used in this paper is based on the U-Net. In image translation task such as coloration, adding skip connections are beneficial because features like edges are invariant.

The discriminator model is altered to encourage high frequency crispness, while the L1 loss component encourages low frequency correctness. The specific alteration involves averaging over N*N patches on "realness".

Evaluation was done with AMT and also FCNs-score which is an interesting way of applying classification networks on generated images and comparing the accuracy with a real dataset.

Question:
- FCN-score is limited to the model's ability to generate a specific type of image (cityscape dataset), to what extent is that a good measure of the model's flexibility that is so highly advertised in the paper?

ReplyDelete
Replies
UnknownApril 20, 2017 at 3:49 PM
In this papers, author uses GAN to find a mapping between images and translations. The author justifies using GAN as, despite success of CNN and their good performance, we still have to find a roper loss function.

By taking GAN into account, generative model tries to learn a mapping to an image from random noise by comparing the right/fake images inputed to the discriminator segment.

for testing their idea, they use several da sets that are representing both semantic segmentation and photo generations. They try to find mappings such as day-->night, sketch-->photo.

Q: Could you please go over eq4? I mean conceptually. Thanks!

ReplyDelete
Replies
JorgeApril 20, 2017 at 3:52 PM
The authors introduce in the paper a general framework for doing image to image translation using conditional GAN (cGAN).

The problem of image to image translation is that of mapping an input image from one space to other, while conserving its graphic structure (going from black and white to colour or from sketch to real image). For this purpose, they make use of cGAN, which is basically a GAN but conditioning the output of the generator and the prediction of the discriminator to a input image. This way, the generator can learn to map images in the initial domain to the final domain.

Concretely, the generator is implemented using a U-Net which is basically an encoder-decoder network with skips between paired layers. Those skips are done so that the final generated image maintains low-level details. The discriminator is a network called PatchGAN that works on different patches of the original image. The final answer is obtained by averaging all the responses to every patch. The authors aim to capture all the high-frequency details of the image with this architecture. The loss used to train the model has two components: a cGAN loss and a L1 loss.

This model has been shown to work well in different scenarios such as mapping from semantic labels to photos or colouring black and white images.

Question: In Figure 6, they compare the performance using L1 and different sizes of patches. I do not understand why they do this comparison. Isn't the choice of the loss independent of the size of the patch?

-- Jorge Sendino
ReplyDelete
Replies
UnknownApril 26, 2017 at 4:45 AM
This paper presents a novel way of using conditional adversarial networks to solve image-to-image translation problems. The brilliance of this paper lies in the fact that its network learns both a mapping from input image to output image, but also a use specific loss function. The network used is a conditional GANs, which maps random noise and a observed image into an output image
The network uses both L2 and L1 loss in order to avoid blurriness in the output images.
One concern with this genre of image generation is that quality evaluation is notoriously difficult and subjective.

What is L2 loss best for? In a subsequent paper L2 loss is used, successfully, for generating rendered images. Does it fail with photographs?
Why does this network support using a randomized input vector if it has a supervised ground truth output in the form of an image? The paper mentions that the model ignores the noise upon learning.

-Sam Woolf
ReplyDelete
Replies

Add comment