Sunday, April 9, 2017

Tues. April 11: Using very deep autoencoders for content-based image retrieval.

Using very deep autoencoders for content-based image retrieval. Krizhevsky, Alex, and Geoffrey E. Hinton. ESANN. 2011.

18 comments:

  1. Nathan Watts’ summary
    This paper proposes a method of hashing and retrieving images based on a compressed vector representation of image features and semantics. This allows images to be retrieved in a very efficient manner and stored in such a way that similar images have similar hashes. The features are learned in an unsupervised manner using deep belief networks, which act as a stack of restricted boltzmann machines, each of which further compresses the internal vector representation of the image created by the previous. The weights from the deep belief network are then used to initialize an autoencoder, which is further trained to reduce the RMS reconstruction error. To allow the use of binary codes, outputs of logistic units are rounded to binary values during the forward pass. They created two of these autoencoders, one with a final size of 28 and one of 256, for greater precision. They found that using both of these in an ensemble provided the best results, as they operated on different levels of abstraction. This method provides a much better method of semantic hashing than nearest-neighbor.

    Questions: Does initializing an autoencoder with the RBM weights mean simply changing the loss function, or is there more involved?

    ReplyDelete
  2. This paper discusses a new method for image retrieval. Specifically, the model uses deep auto encoders in order to map images to short binary codes. Then, using semantic hashing, these 28 bit codes can be used to retrieve images that are similar to a query image. The nature of the encoding allows for billions of images to be searched in very little time (milliseconds).To learn the codes, the paper proposes a method using Deep Belief Networks.

    In order to determine the image code, the paper normalizes the image data. In the future, do you need to normalize a new image before quarrying for similar images?

    How can you reconstruct an image from a 256 bit code? Seems like a lot of overfitting. Does it work on new images not in the training set?

    -Sam Woolf

    ReplyDelete
  3. Xinmeng’s summary
    The paper propose two bit level of autoencoder to learn the features of the image in deep layers, in order to save the space and searching time when representing an image. The author use the encoded image and retrieve the original image to evaluate the result of the autoencoder. The paper compare the result model with spectral and Euclidean distance which turn out the method of using “Restricted Boltzmann Machines”(RBM) has the best performance among them. Also, without surprise, the 256 bit encoder had higher resolution when resolution than the 28 bit encoder since it has more information contained. By using semantic hashing, the coding can retrieve the objects in a time that it is independent of the size of the database, to improve the speed to combine the result. The paper also use the method of “retina” to pad the image, rather than treating the spatial equivalently it focus on the central area of the pad, which also helped to reduce the calculation and improve the accuracy. The paper propose a different method which does not remain the pixel wise features but binary code. As the author mentioned the 256 bit binary codes can help to prune the 28 bit codes by restricting the image which differs 5 bits or less of the original image.

    Discussion: When mentioned that using 256 bit binary codes in order to prune the 28 bit codes, will it slow down the process of using both 256 and 28 bit autoencoder? Why just use the 256 bit autoencoder?

    ReplyDelete
  4. The authors present a method for compressing images to bitpacked representations. Using 256 bits, they are able to restore the image, although with a lot of loss. Using 28 bits, they are unable to restore the image, but they show a few really amazing results. First, classifiers may still be able to recognize the image. More importantly, it allows for ultra fast search, as it is constant time independent of the number of images. In the reduced, compressed vector space, there is also spatial semantics - images that are close together in Manhattan distance are similar. They extend this idea to make a super fast metrizable data structure used for image clustering.

    The compression is accomplished using a Deep Belief Network (DBN), a deep autoencoder that forces the image to be represented by the middle layer. In particular, they use layers of restricted boltzmann machines (RBMs).

    Question:

    For search, they mention that they first build an array with 2^28 elements in it. Each of these elements holds a pointer to a bucket of objects, so it must hold a memory address. Even on a 32 bit machine, that would mean that the the array contains 2^60 bits, which is far too large for any storage. Then they must use a much smaller address space. However, they claim that they can search in an arbitrarily big database. How would this work? Suppose we had even just 1 million images. That would require an address length of 20 bits, resulting in the array being 2^48. That's 250 TB. I guess you could put that in a large data store, but it's very misleading - even searching through that data store is going to take some time. Am I interpreting that correctly?

    ReplyDelete
  5. The authors cover using deep autoencoders to map color images to binary codes. They used Deep Belief Networks (DBNs) and two autoencoders to create a system to output binary codes of size 28 bits and 256 bits. Each layer in the DBN compresses the representation provided by the previous layer. The 256 bit representation allowed for more precise image reconstruction, but using the two methods together was the best for image searching. Representing the images with these short binary codes is very useful for search, since you can transform an images code with a fixed number of variations and then query a database for those exact new codes to find similar images (so search speed is more independent of database size). Compared with spectral codes and euclidian distance, it seems that the deep network codes produced the most accurate results for a similar image search.

    Discussion: They mention also using label information - that seems very interesting since then maybe in searching for similar images you could weigh both visual and content similarity as search parameters.

    ReplyDelete
  6. This paper presents a way to retrieve similar images in pretty fast time with small memory, and this approach is independent of the size of database.During training, it uses deep encoders to map small color images to short binary codes which are then used for semantic hashing and the code representation captures semantic information of the image . For the query image, binary code is generated and similar images are retrieved . Model they used for training is deep belief networks that are created by learning a stack of Restricted Boltzmann Machines, also do fine-tuning the autoencoder using back-propagation. During testing, comparing results quantitatively and qualitatively. Qualitatively, using 256-bits deep codes performs much better than using Euclidean distance, 256-bit spectral codes are much worse. Quantitatively, 28-bit deep codes performs as well as 256-bit spectral codes.
    Discussion: It mentioned autoencoders initialised by DBNs were used to obtain short binary codes for documents and these codes could be used for semantic hashing, but not necessary for documents retrieval. But in future work, it says that deep autoencoders that are pre-trained as DBNs also works well for document retrieval .

    ReplyDelete
  7. This paper explores a method for encoding images for quick retrieval. The authors took small images and passed them through several Restricted Boltzmann Machines to produce 28 and 256 bit binary codes for each image. The result is that similar images have binary codes that differ only by a few bits. The authors also developed semantic hashing as a way of speedier retrieval. They sample several patches from each image and encode these patches into 28 bits, then look up the codes similar to the patches’ codes in the hash table and score the images found in the table lookup. They combine these scores with images found from looking up the 256 bit encoding of the original query image to produce the final list of images.

    Discussion:
    How well does this scheme work with larger images, or images of different sizes?
    How did they conclude that the qualitative results of their method was superior to other methods? Was it just by eye?

    ReplyDelete
  8. This paper presents surprisingly effective deep auto-encoders that seem to, in a qualitative way, be able to encode image features in a surprisingly effective way. Their method encodes images into a very small number of bits and allows for very effective search times.

    Discussion:
    There has been a long history of rank reduction and dimension reduction with Eigenvalue/Eigen vector, or Spectral, methods. How do we reconcile the authors' deep, stochastic methods to known mathematical properties that show Spectral methods are mathematically optimal? (E.G the Eckart-Young Theorem)

    ReplyDelete
  9. Summary by Jason Krone:

    This paper discusses the use of autoencoders for content-based retrieval of images in a database. The authors used Deep Belief Networks (DBNs), a stack of Restricted Boltzmann Machines, to map input images to short binary codes. These DBNs made of stacked RBMs are generative models that learn a variational lower bound on the log probability of the data. The authors use contrastive divergence learning to train the RBMs. Similarly to current practices, the RBMs were initialized with random weights, zero biases, and trained for 80 epochs with mini-batches of size 128. Square reconstruction error is used as the loss function for the auto encoder. Interestingly, the authors forced encodings to be binary by rounding the output of logistic units to 1 or 0. This method is 1000 times faster than using euclidian distance to find similar images from a dataset and also produces superior results.

    Questions?
    - How is content based image retrieval done currently?
    - What is “the hamming distance” and “hamming ball”?

    ReplyDelete
  10. This paper presents the autocoder that to me sounds a lot like a chain of fully connected layers. Each FC-layer reduces the size of the image, and after a few layers, they coerce the model to provide a binary hidden state (i.e the low-dimensional representation). They say this model is a thousand times faster than the comparative model, which they describe as a model that uses Eucledian distsance. But didn't they use RMS reconstruction error too? They also talk about improving accuracy by using more of the 28-bit codes.

    ReplyDelete
  11. The researchers present a method to map colored images to short binary codes using deep autoencoders. Some of the advantages in using binary codes include cheap storage and fast bit-wise comparisons. The researchers initialized a 28-bit and 256-bit autoencoder. Using 256-bit binary codes performed better and faster than using Euclidean distance and using 28-bit binary codes performed just as well as 256-bit spectral codes.

    Question:
    In the future work section, it mentions that it would be easy to learn deep binary codes that are suitable for both reconstructing the image and a label or caption. Has there been such work done now considering this paper is a few years old?

    ReplyDelete
  12. The following publication presents a methodology for content-based image by mapping small RGB images into binary codes using very deep autoencoders. The authors commence with detailing the advantages of using binary codes for the task. The authors use Deep Belief Networks DBNs with stacks of Restricted Boltzmann Machines RBM to learn short binary codes. Further, they select a binary state vector for each minibatch of data vectors. Next they reconstruct the visible vectors, recompute the hidden states and finally update the weights. The autoencoders are fine tuned using backpropagation and error minimization.
    Semantic hashing is used to train autoencoders on local patches of image. The Semantic hashing maps images to approximate binary codes. 28-bit binary sematic hashing codes is used to maximizes search performance in the database. This is very useful in content-based image retrieval and the results are both quantitative and qualitatively better compared to other methods.
    Question – How does training a on a bag-of-local patches work. For example if we have an image of a palace that takes up most of the image area – while training we break the image into small segments and learn the features from it and store it as short binary codes. However,say, we have a new test image. Are we certain that the test image will also be broken into exactly same number of segments and each segment will generate approximately similar set of features – so that finally the DB binary code and the new generated coded will match.

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. In this paper, a method for autocoding images is presented that enables extremely fast image retrieval. The idea of this retrieval solution is to represent images as short codes of either 28-bit or 256-bit length. Using this code, the images can be retrieved by semantic hashing. The autocoder works through multiple layers of Restricted Boltmann Machines (RBM) that contain both a visual and a hidden layer. The layers weights are initialized by learning Deep Belief Networks (DBN). DBN’s work by iteratively learning each layer individually, by passing activations from the visual layer to the hidden layer, and back again until the layer can bring back the original image in the visual layer. A 1.6 million image dataset was obtained by querying for non-abstract English nouns. After training, the method was tested by evaluating the fraction of images returned on a query image that were of the same class. Also, the CIFAR-10 dataset was used for evaluation. The autoencoding features at both 28-bit and 256-bit both outperformed Euclidian distance search at 1/1000 of the time. Results were improved by searching on multiple transformations of the query image at the same time. In addition, searching on patches of the images yielded the best results, significantly outperforming baselines.
    Discussion:
    Can we walk through what a restricted boltmann machine is, and how each layer is trained like a deep belief network?

    -Jonathan Hohrath

    ReplyDelete
  15. The authors of this paper designed a method for doing image semantic hashing using deep autoencoders. They developed this as a novel approach to image retrieval.

    The image codes are obtained from an autoencoder composed by stacking together multiple Restricted Boltzmann Machines and initializing the weights using Deep Belief Networks. The last layer of those autoencoders is rounded so that it forms a valid binary code. Two different autoencoders were trained, one build to provide 28 bit hashes and the other one to provide 256 bit hashes. Then, linear search on the full dataset is done to obtain images with similar code.

    They evaluated this method both qualitatively and quantitatively using a custom dataset and the CIFAR-10 dataset, respectively. As an improvement to the model, they designed a multiple semantic hashing so that this process is performed not on the whole image but separately on different patches of the same image.

    Question: What is exactly a Deep Belief Network? Does it affect the model in any other way apart from the initialization (could we substitute it by a Xavier initialization and still get a running model)?


    ReplyDelete
  16. Jie's summary,

    In this paper, the authors introduced a way of doing image retrieval by using very deep autoencoders. This is done by using Deep Belief Network (DBN), which is a multilayer, stochastic generative model that are created by learning a stack of Restricted Boltzmann Machines (RBM).

    There were two autoencoders trained. One can compress the image into a 28-bits representation and the other can compress the image into 256-bits representation. They showed that for 256-bits representation, it is possible to retrieve the image. For 28-bits representation, the similar pictures are still close to each other. By doing multiple semantic hashing, the encoding provide very efficient way for searching an image with constant time. The result is both qualitative an quantitate better than other methods.

    Question: It would be helpful if we can go over the Restricted Boltzman Machines and Deep Belief Network.

    ReplyDelete
  17. This paper presents an architecture for autoencoding of images for semantic hashing. They use stacked Restricted Boltzmann Machines to generate a binary 28 bit code for an image, then search over codes with low Hamming distance to find semantically similar images. Their results are qualitatively better (according to the authors) than previous efforts using spectral codes. They account for local features and translation invariants by running multiple patches and transformations through the network for each image, then computing scores for matching each patch with sets of images, and summing the scores for each patch to find the sets of similar images.

    Why not use convolutional networks for autoencoding? The authors say that binary codes work better than real-valued codes for image retrieval. Are there other problems in machine vision that could benefit from working with binary representations rather than real-valued representations? Since they use probabilities within the network this seems, at a high level, similar to using batch normalization within a network with a fixed gamma and beta. Has anyone tried building autoencoders using real-valued networks and batch normalization?

    -Jay DeStories

    ReplyDelete
  18. In this papers, the authors proposed the use of deep autoencoders for image hashing, reconstruction and searching.
    Unlike convolutional type networks, the autoencoder here is initialized with Deep Believe Network which is globally optimized to reconstruct every layer's input. This is because DBN are formed by stacking RBMS.

    Training involves two steps:
    1) Train the DBN via standard contrastive divergence procedure, carefully selecting learning rate and momentum to avoid spikes in the beginning.
    2) Finetune the DBN/autoencoder to construct bitcodes by rounding last logistic layer and backpropagating (semi-supervised)

    The models proposed were much better than euclidean distance searching and it is quite a bit faster. An actual implementation is discussed using a hash table and doing linear search to do related image searching.

    ReplyDelete