Deep Learning for Computer Vision Spring 2017: Tues April 25: Learning to Generate Chairs

Friday, April 21, 2017

Tues April 25: Learning to Generate Chairs

Learning to Generate Chairs, Tables and Cars with Convolutional Networks. Alexey Dosovitskiy, Jost Tobias Springenberg, Maxim Tatarchenko, Thomas Brox. CVPR 2015.

17 comments:

UnknownApril 22, 2017 at 5:02 PM
Nathan Watts’ summary:
This paper presents a method for applying convolutional neural networks to understanding 3d models and establishes some data handling practices for this new dataset domain. They show that the network is learning meaningful features by demonstrating knowledge transfer within classes and between classes, through latent vector arithmetic and interpolation, and by generating novel examples.
The model consists of two parts- on which encodes the 3d model, and a generator which decodes it into a 2d projection. The encoder is simply a fully connected network which processes the input model/class, the view parameters, and the transform parameters independently first, and then concatenates them for further processing. The generative part of the model works using “up-convolution,” which seems very similar to the technique used by convolutional GANs (though they don’t use a fully convolutional network and in fact use “unpooling” rather than strided up-convolution). The network also generates a segmentation map in parallel with the projection generation. An alternate model was also tested that uses a probabilistic generator which provided better interpolation results. Several different network architectures were tested, and this was found to be the best performing.
They found that increasing the dataset size reduced the performance at generating find details, decreasing visual output quality, but dramatically decreased error- this is likely a result of the network being forced to create better abstractions of the feature set rather than memorizing fine details. Data augmentation has a similar effect on the visual quality, but also increases the error. However, the augmented network was shown to have better performance on interpolation, suggesting that it was a trade-off for better generalization.
They also visualized the network in many different ways to try to understand it, as it is a fairly novel application. They isolated the factors that control color, position, transformation, etc.

Questions:
Why is the 4-element “view” vector being up-sampled to a 512-element vector? How would this be at all useful?
What exactly is the probabilistic generator? I didn’t fully understand this part.
ReplyDelete
Replies
UnknownApril 24, 2017 at 6:38 PM
Xinmeng’s Summary
The paper present a generative neural networks which can generate image of objects from a given object style, viewpoint and color. The method is to assess the similarity of different models and interpolate between given viewpoint image. It learn the distribution from which the image are to be generated and learn the generator which produces an image conditioned on a vector from this distribution. The paper train a neural network capable of generating 2D projections of the models and accept input only high level values and produce RGB images. The model is trained with standard backpropagation to minimize the Euclidean reconstruction error of the generated image.The high-level latent representation of the images and supervised training allow to generate relatively large high-quality images of image and completely control which images to generate rather than relying on random sampling.

Discussion: What is the difference between the mapping from representation to high dimensional image, described as unpooling + convolution ("up-convolution") and the "deconvolutional" of previous work?
ReplyDelete
Replies
UnknownApril 24, 2017 at 7:20 PM
This paper proposed to use a dataset of 3d models, train generative up-convolutional neural networks that can generate 2d projections of objects from high-level descriptions. These networks are capable of knowledge transfer within an object class, knowledge transfer between classes and feature arithmetics, interpolation between different objects within a class and between classes, randomly generating new object styles. About the network architectures, output of this network is RGB image x and segmentation mask s, inputs are class c the model style, view v the horizontal angle and elevation of camera position and transfer parameter theta additional information applied to images, they use several network architectures and difference between these architectures is one less or more up-convolution. For network training, parameters w are trained by minimizing the error of reconstructing the segmented-out chair image and the segmentation mask. The experiments suggest that networks trained to reconstruct objects of multiple classes develop some understanding of 3d shape and geometry.
ReplyDelete
Replies
JonApril 24, 2017 at 7:20 PM
In this paper, generative models were built to create objects, including chairs, tables, and cars from high-level descriptions. The dataset to train these models was 2D snapshots from sets of 3D models at different viewpoints. The generative models architecture was approximately an inverse of a standard convolutional network. The input descriptions (class, view, and transf) were passed to fully connected layers, to a combination of convolutional layers and unpooling layers in order to generate 64x64, 128x128, or 256x256 images. Two separate losses, including the actual loss on the image, and a segmentation loss was used. The generative models trained in this manner showed interesting properties, including the ability to generate transformed images, interpolate between viewpoints, extrapolate new views, interpolate between styles, perform feature space arithmetic, and generate random objects. The performance was analyzed quantitatively through looking at correspondences between keypoints, comparing to baseline algorithms and human performance. In general, these generative models outperformed baselines, but did not come close to human level performance. Several visualizations were generated to analyze these networks, including analyzing activations of single units, analyzing activations across entire layers of the network.

Discussion:
I’m wondering what some practical applications of these networks would be, or what kind of innovations these models will enable?

-Jonathan Hohrath
ReplyDelete
Replies
UnknownApril 24, 2017 at 8:25 PM
The model in this paper generates pictures of chairs/cars given a few input parameters: chair type aka style, elevation and azimuth (angle?) of the camera, and image transformation. This is in effect the reverse of a classical conv net for classification. They use a few uppool + conv layers to bring the dimension up to that of the output image.

They show that the model learns features, not memorizes the training sets, by showing that interpolation, varying the transformation, elevation, style in the input, etc. work somewhat well. They talk about how data augmentation makes the generated image less sharp but improves interpolation and thus generalizes better, which I found interesting.
ReplyDelete
Replies
UnknownApril 24, 2017 at 8:58 PM
In this paper, the authors present an up-convolutional generative network which produces images given 3d model descriptions. The network architecture is very similar to early examples of deep convolutional networks such as AlexNet, but in reverse, with the input connected to fully-connected layers, the output of which is upsampled through unpooling and convolution. The model is capable of generating transformations of images, indicating that it may be learning more interesting semantic details than some of the other generative models we've seen. The authors demonstrate that they can interpolate between representations to produce images with new translation, rotation, etc. In addition, they are able to interpolate between different objects and produce images that share features of both.

Why did the authors use up-convolutions instead of deconvolutions?

-Jay DeStories
ReplyDelete
Replies
AnonymousApril 24, 2017 at 9:39 PM
Summary by Jason Krone:

The model takes as input a data set of one-hot style vectors, viewpoint, and theta (additional transformations) and produces images and segmentation masks. They use unpooling with the number routed to the upper left corner and zeros elsewhere. The leaky ReLU activation function is used throughout the network. They experiment with two network architectures. Both architectures begin with 5 FC layers and end with 4 uconv layers. The first network generates the image and segmentation mask using separate uconv layers while the other uses one stream of uconv layers to generate both outputs. The loss consists of two terms 1. the generated image loss (squared euclidean distance) and 2. segmentation map loss (negative log likelihood or square euclidean distance). The network is able to generalize across multiple previously unseen values of parameters such as viewpoint as well as between styles.

Questions:
What are the advantages and disadvantages of this approach compared to GANs?
ReplyDelete
Replies
Dylan CashmanApril 24, 2017 at 9:42 PM
This paper demonstrates an upsampling network that can generate varying flat images given a training set of 3-dimensional objects. Generating flat images from a 3-D space is the fundamental basic problem of computer graphics - you place a camera and calculate the pixel colors. However, this network is doing a different task. It learns an embedding of the training set in the image space. Then, different samples from this embedding can produce 2-D pictures of objects the network never saw. It can also do things like move the camera around an object.

It is very similar to papers we've seen on generating faces and images in that walking around the space is a bit disappointing - it doesn't produce realistic pictures. However, it is encouraging to think of this network as a compression of the 3-D space. It can accomplish some of the same things a moving camera might, but without a real notion of the complexity of the object in 3 dimensions.

Question: Can you review the use of the segmentation mask? I wasn't sure what that was.
ReplyDelete
Replies
UnknownApril 24, 2017 at 10:34 PM
This paper presents investigates how a proposed network is able to construct, from a feature vector that defines chair 'style', view angle, and transformation vector to generate new images of chairs.

One piece of insight that I found to be particularly interesting is the inclusion of the "segmentation" loss. The authors here notice that by forcing the network to also generate a segmentation from a given feature vector, the network encapsulates our intuition that the edges or boundaries of objects are incredibly important.

The authors also present a more natural and powerful way to investigate GAN units. Instead of zeroing out all but one activation in a layer, the authors investigate the role of a unit by first generating an image, then modulating the values of only one unit to investigate the effect of the unit on the generated image as a whole.

Discussion:

The merit of this paper really does lie in it's investigation into the properties of the latent space described by the original feature vector and the roles of each unit in the network. However I felt that, in comparison, investigation into the following two issues were lacking:

1. What happens when the network does not incorporate a segmentation stream / loss?

2. How was the latent space learned? How was the architecture of the encoder? (The latent space seems incredibly important here! Look at the power of simple linear interpolation)

More Discussion:
How a chair's style is encoded seems important. Can we investigate, in a similar way to that which is presented in the paper, the role of the units of an autoencoder? Can we similarly first perform a forward pass, modulate activations in selected units, then inspect the output of the discriminator?

ReplyDelete
Replies
UnknownApril 24, 2017 at 11:49 PM
The researchers in the paper train networks that are able to create 2D projections of 3D models given style, viewpoint, and color encoded as a model number. The task of the model is the reverse of typical classification tasks where instead of converting an image into a compressed high-level representation, the task is to generate an image from high-level parameters. For data augmentation, the researchers found that it led to blurriness but it helped in generalization. The trained networks were surprisingly capable of both knowledge transfer within and between classes.
ReplyDelete
Replies
UnknownApril 25, 2017 at 12:10 AM
A generative "up-sampling" model is proposed in this paper. This model was trained to generate chairs, tables and cars of different levels of zoom, color, style, and viewing angles by training on a 3D model dataset. The novel idea proposed in this paper is the use of a two-stream network and learned encoding from desired parameters to a latent feature space. The upsampling network then uses "up-pooling" + convolutions on the latent space input to generate in one stream an RGB image and another stream, a segmentation mask.

Similar to an autoencoder paper we saw a week ago. This learnt latent space encoding ends up being immensely powerful. Features in that spaces can be subtracted or added together.

Comments + Questions:
- It seems weird to use L2 loss when so many literatures have commented on its tendency to encourage image blurring.
- The paper mentioned a very interesting method of doing probabilistic generative modeling using an inference network. How does that work?
- In the correspondence section, they use a morphing of 64 images to generate the optical flow. Is the morphing a direct projection from one latent space vector to another?
ReplyDelete
Replies
UnknownApril 25, 2017 at 9:14 AM
This paper introduces a generative “up-convolutional” network to generate 3D models of chairs. It takes as input the style, orientation, and artificial transform of the chair to generate, and outputs an RGB image and segmentation of the chair. They define the operation of up-convolution as an un-pooling layer followed by or performed simultaneously with a convolutional layer.

Discussion:
I don’t understand the purpose of generating a segmentation mask if the RGB image already outputs the shape of the chair.
ReplyDelete
Replies
UnknownApril 25, 2017 at 9:31 AM
This paper investigate the problem of generating 2-D models bu learning a neural net. Usually problem is viewed as a 2-part approach; getting the distribution of the generative model and generation method of that distribution.
The proposed algorithm in this paper uses an up-convolutional(somehow reversed net) to combine different high-level description of images such as class, view, angle, etc. to derive 2-D description of model.

This algorithm can interact between different classes, different angles and description of as single class to generate a new novel picture.

Discussion: There appears to be a little math or analysis of the net. Have you seen anything in other papers?
In equation 1, I don't get the idea of setting up the loss function. Could you please go over it and explain choosing Lambda?
ReplyDelete
Replies
ATongApril 25, 2017 at 9:46 AM
This paper investigates a form of supervised generative networks. This is unusual as generative models are generally used in the unsupervised case. This is possible with the rendering of 3D models to 2D used as training data.

The authors then investigated the learned hidden state of the model showing that it had many interesting and reasonable properties.

I'm curious as to the shape of their network. As we saw in the previous paper, most generative networks learn a very narrow representation of the world (encoder decoder scheme). This model seems to have a number of fully convolutional layers but no narrowing of the space. What is the effect of the size of the hidden state on the generative properties of the network.

Another question is how could this be applied to generating meaningful context to the images? ex. Non white background.
ReplyDelete
Replies
UnknownApril 25, 2017 at 11:45 AM
Sam Burck:

This paper introduces a generative CNN that looks like a classification CNN in reverse, except there are two sequences of convolutional layers that emerge from a single fully connected layer. The role of the two different sequences of convolutional layers is to allow for two different loss functions - one loss for color, and one loss for segmentation. The input takes the form of 3 vectors that contain information related to class, viewpoint, and an "additional transformation" vector, which contains information such as "color, brightness, saturation, zoom, etc." Using this architecture, the authors were able to train a network that was not only capable of creating convincing images, but also a network that was able to generalize unseen training data. One big takeaway from this paper is that the network is capable of "developing some understanding of 3d shape and geometry."

Questions:

How is this network learning features related to 3d space? Are there visualizations we could use to take a peek?
ReplyDelete
Replies
JorgeApril 25, 2017 at 3:33 PM
"Learning to Generate Chairs, Tables and Cars with Convolutional Networks" introduces a new generative CNN.

These networks are trained to solve the task of generating chairs, tables and cars using high level descriptors of the image. From those descriptors, they will try to reconstruct the original image using a series of up-convolutions. Their CNNs are basically classical CNNs (convolution, then fully-connected) upside-down. A new stream is added to the networks so that they also output a segmentation mask. Then, the training is performed using a loss that combines a RGB squared Euclidean distance for the image, and for the segmentation mask the tried both a squared Euclidean distance and a negative log-likelihood.

Once the network is trained, they performed several experiments to evaluate it. These experiments range from evaluating the capacity of the network to do transfer learning between points of view or angles never seen before or interpolate between different and evaluate how the weights are learned and activated with different functions.

Question: In previous papers, we have seen how the Euclidean loss performs bad for generation task. Specifically, it tends to produce blurry results. In this case, it seems that it is working quite well. Is it because the images are simple 3D renders and not real-world images? Have the authors tried to use more complex loss functions?
ReplyDelete
Replies
UnknownApril 25, 2017 at 3:34 PM
This paper talks about the architecture and experimental results of a network which generates images and their selection maps based on inputs which identify object class, angle and elevation, and other transformations. The result is a network which can generate images of chairs, tables, and cars given these parameters. The paper goes on to demonstrate their evidence that their model has captured semantic understanding of the various objects, by visualizing various transformations to the model, interpolating viewing angles of the objects, and visualizing the transitions from one object to another, both intra-class and inter-class. It also demonstrates an ability to perform semantic arithmetic, as we’ve seen before, such as performing “A – B + C” where A, B, and C are chairs and the result is a chair which has features from A and C, without features that would be common between A and B. It goes on to visualize the similarities of keypoints of chairs, i.e. top right corner of the back of the chair, bottom of front left leg, etc. ,to their representation in semantic space. The paper also helps visualize the semantic representations of the images by activating single neurons, as we’ve seen before, and partially masking the FC layers and visualizing the result.

Discussion: In the unpooling layers, why is it better to activate one pixel and leave the rest as zero, as opposed to bilinear upsampling or deconvolution? What could be a benefit to that method?

Ben Papp
ReplyDelete
Replies

Add comment