diff --git a/README.md b/README.md index 136d6152..578b9c61 100755 --- a/README.md +++ b/README.md @@ -4,6 +4,8 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho ## List of papers +* [MONet - Unsupervised Scene Decomposition and Representation](https://shagunsodhani.com/papers-I-read/MONet-Unsupervised-Scene-Decomposition-and-Representation) +* [Revisiting Fundamentals of Experience Replay](https://shagunsodhani.com/papers-I-read/Revisiting-Fundamentals-of-Experience-Replay) * [Deep Reinforcement Learning and the Deadly Triad](https://shagunsodhani.com/papers-I-read/Deep-Reinforcement-Learning-and-the-Deadly-Triad) * [Alpha Net: Adaptation with Composition in Classifier Space](https://shagunsodhani.com/papers-I-read/Alpha-Net-Adaptation-with-Composition-in-Classifier-Space) * [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://shagunsodhani.com/papers-I-read/Outrageously-Large-Neural-Networks-The-Sparsely-Gated-Mixture-of-Experts-Layer) diff --git a/site/_posts/2020-09-14-MONet Unsupervised Scene Decomposition and Representation.md b/site/_posts/2020-09-14-MONet Unsupervised Scene Decomposition and Representation.md new file mode 100755 index 00000000..54a0a589 --- /dev/null +++ b/site/_posts/2020-09-14-MONet Unsupervised Scene Decomposition and Representation.md @@ -0,0 +1,59 @@ +--- +layout: post +title: MONet - Unsupervised Scene Decomposition and Representation +comments: True +excerpt: +tags: ['2019', 'Object-Oriented Learning', AI, Attention, CV, Unsupervised] + + +--- + +## Introduction + +* The paper introduces Multi-Object Network (MONet) architecture that learns a modular representation of images by spatially decomposing scenes into *objects* and learning a representation for these *objects*. + +* [Link to the paper](https://arxiv.org/abs/1901.11390) + +## Architecture + +* Two components: + + * Attention Module: generates spatial masks corresponding to the *objects* in the scene. + + * VAE: learn representation for each *object*. + +* VAE components: + + * Encoder: It takes as input the image and the attention mask generated by the attention module and produce the parameters for distribution over latent variable *z*. + + * Decoder: It takes as input the latent variable *z* and attempts to reproduce the image. + +* The decoder loss term is weighted by mask, i.e., the decoder tries to reproduce only those parts of the image that the attention mask focuses on. + +* The attention mechanism is auto-regressive with an ongoing state (called a scope) that tracks which parts of the image are not yet attended over. + +* In the last step, no attention mask is computed, and the previous scope is used as-is. This ensures that all the masks sum to 1. + +* The VAE also models the attention mask over the components, i.e., the probability that the pixels belong to a particular component. + +## Motivation + +* A model could efficiently process compositional visual scenes if it can exploit some recurring structures in the scene. + +* The paper validates this hypothesis by showing that an autoencoder performs better if it can build up the scenes compositionally, processing one mask at a time (these masks are ground-truth spatial masks) rather than processing the scene at once. + +## Results + +* VAE encoder parameterizes a diagonal Gaussian latent posterior with a spatial broadcast decoder that encourages the VAE to learn disentangled features. + +* MONet with seven slots is trained on *Objects Room* dataset with 1-3 objects. + + * It learns to generate different attention mask for different objects. + + * Combining the reconstructed components using the corresponding attention masks produces good quality reconstruction for the entire scene. + + * Since it is an autoregressive model, MONet can be evaluated for more slots. The model generalizes to novel scene configurations (not seen during training). + +* On the Multi-dSprites dataset (modification of the dSprites dataset), the model (post-training) distinguishes individual sprites and background. + +* On the CLEVER data (2-10 objects per image), the model generates good image segmentation and reconstructions and can distinguish between overlapping shapes. \ No newline at end of file