Skip to content

Commit

Permalink
Add MONet paper
Browse files Browse the repository at this point in the history
  • Loading branch information
shagunsodhani committed Nov 25, 2020
1 parent 6ca1430 commit c60618a
Show file tree
Hide file tree
Showing 2 changed files with 61 additions and 0 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho

## List of papers

* [MONet - Unsupervised Scene Decomposition and Representation](https://shagunsodhani.com/papers-I-read/MONet-Unsupervised-Scene-Decomposition-and-Representation)
* [Revisiting Fundamentals of Experience Replay](https://shagunsodhani.com/papers-I-read/Revisiting-Fundamentals-of-Experience-Replay)
* [Deep Reinforcement Learning and the Deadly Triad](https://shagunsodhani.com/papers-I-read/Deep-Reinforcement-Learning-and-the-Deadly-Triad)
* [Alpha Net: Adaptation with Composition in Classifier Space](https://shagunsodhani.com/papers-I-read/Alpha-Net-Adaptation-with-Composition-in-Classifier-Space)
* [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://shagunsodhani.com/papers-I-read/Outrageously-Large-Neural-Networks-The-Sparsely-Gated-Mixture-of-Experts-Layer)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
---
layout: post
title: MONet - Unsupervised Scene Decomposition and Representation
comments: True
excerpt:
tags: ['2019', 'Object-Oriented Learning', AI, Attention, CV, Unsupervised]


---

## Introduction

* The paper introduces Multi-Object Network (MONet) architecture that learns a modular representation of images by spatially decomposing scenes into *objects* and learning a representation for these *objects*.

* [Link to the paper](https://arxiv.org/abs/1901.11390)

## Architecture

* Two components:

* Attention Module: generates spatial masks corresponding to the *objects* in the scene.

* VAE: learn representation for each *object*.

* VAE components:

* Encoder: It takes as input the image and the attention mask generated by the attention module and produce the parameters for distribution over latent variable *z*.

* Decoder: It takes as input the latent variable *z* and attempts to reproduce the image.

* The decoder loss term is weighted by mask, i.e., the decoder tries to reproduce only those parts of the image that the attention mask focuses on.

* The attention mechanism is auto-regressive with an ongoing state (called a scope) that tracks which parts of the image are not yet attended over.

* In the last step, no attention mask is computed, and the previous scope is used as-is. This ensures that all the masks sum to 1.

* The VAE also models the attention mask over the components, i.e., the probability that the pixels belong to a particular component.

## Motivation

* A model could efficiently process compositional visual scenes if it can exploit some recurring structures in the scene.

* The paper validates this hypothesis by showing that an autoencoder performs better if it can build up the scenes compositionally, processing one mask at a time (these masks are ground-truth spatial masks) rather than processing the scene at once.

## Results

* VAE encoder parameterizes a diagonal Gaussian latent posterior with a spatial broadcast decoder that encourages the VAE to learn disentangled features.

* MONet with seven slots is trained on *Objects Room* dataset with 1-3 objects.

* It learns to generate different attention mask for different objects.

* Combining the reconstructed components using the corresponding attention masks produces good quality reconstruction for the entire scene.

* Since it is an autoregressive model, MONet can be evaluated for more slots. The model generalizes to novel scene configurations (not seen during training).

* On the Multi-dSprites dataset (modification of the dSprites dataset), the model (post-training) distinguishes individual sprites and background.

* On the CLEVER data (2-10 objects per image), the model generates good image segmentation and reconstructions and can distinguish between overlapping shapes.

0 comments on commit c60618a

Please sign in to comment.