Add MONet paper

shagunsodhani · Nov 25, 2020 · c60618a · c60618a
1 parent 6ca1430
commit c60618a
Show file tree

Hide file tree

Showing 2 changed files with 61 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -4,6 +4,8 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho
 
 ## List of papers
 
+* [MONet - Unsupervised Scene Decomposition and Representation](https://shagunsodhani.com/papers-I-read/MONet-Unsupervised-Scene-Decomposition-and-Representation)
+* [Revisiting Fundamentals of Experience Replay](https://shagunsodhani.com/papers-I-read/Revisiting-Fundamentals-of-Experience-Replay)
 * [Deep Reinforcement Learning and the Deadly Triad](https://shagunsodhani.com/papers-I-read/Deep-Reinforcement-Learning-and-the-Deadly-Triad)
 * [Alpha Net: Adaptation with Composition in Classifier Space](https://shagunsodhani.com/papers-I-read/Alpha-Net-Adaptation-with-Composition-in-Classifier-Space)
 * [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://shagunsodhani.com/papers-I-read/Outrageously-Large-Neural-Networks-The-Sparsely-Gated-Mixture-of-Experts-Layer)

diff --git a/.../_posts/2020-09-14-MONet Unsupervised Scene Decomposition and Representation.md b/.../_posts/2020-09-14-MONet Unsupervised Scene Decomposition and Representation.md
@@ -0,0 +1,59 @@
+---
+layout: post
+title: MONet - Unsupervised Scene Decomposition and Representation
+comments: True
+excerpt: 
+tags: ['2019', 'Object-Oriented Learning', AI, Attention, CV, Unsupervised]
+
+
+---
+
+## Introduction
+
+* The paper introduces Multi-Object Network (MONet) architecture that learns a modular representation of images by spatially decomposing scenes into *objects* and learning a representation for these *objects*.
+
+* [Link to the paper](https://arxiv.org/abs/1901.11390)
+
+## Architecture
+
+* Two components:
+
+    * Attention Module: generates spatial masks corresponding to the *objects* in the scene.
+
+    * VAE: learn representation for each *object*.
+
+* VAE components:
+
+    * Encoder: It takes as input the image and the attention mask generated by the attention module and produce the parameters for distribution over latent variable *z*.
+
+    * Decoder: It takes as input the latent variable *z* and attempts to reproduce the image. 
+
+* The decoder loss term is weighted by mask, i.e., the decoder tries to reproduce only those parts of the image that the attention mask focuses on.
+
+* The attention mechanism is auto-regressive with an ongoing state (called a scope) that tracks which parts of the image are not yet attended over.
+
+* In the last step, no attention mask is computed, and the previous scope is used as-is. This ensures that all the masks sum to 1.
+
+* The VAE also models the attention mask over the components, i.e., the probability that the pixels belong to a particular component.
+
+## Motivation
+
+* A model could efficiently process compositional visual scenes if it can exploit some recurring structures in the scene.
+
+* The paper validates this hypothesis by showing that an autoencoder performs better if it can build up the scenes compositionally, processing one mask at a time (these masks are ground-truth spatial masks) rather than processing the scene at once.
+
+## Results
+
+* VAE encoder parameterizes a diagonal Gaussian latent posterior with a spatial broadcast decoder that encourages the VAE to learn disentangled features.
+
+* MONet with seven slots is trained on *Objects Room* dataset with 1-3 objects. 
+
+    * It learns to generate different attention mask for different objects.
+
+    * Combining the reconstructed components using the corresponding attention masks produces good quality reconstruction for the entire scene.
+
+    * Since it is an autoregressive model, MONet can be evaluated for more slots. The model generalizes to novel scene configurations (not seen during training).
+
+* On the Multi-dSprites dataset (modification of the dSprites dataset), the model (post-training) distinguishes individual sprites and background.
+
+* On the CLEVER data (2-10 objects per image), the model generates good image segmentation and reconstructions and can distinguish between overlapping shapes.