-
Notifications
You must be signed in to change notification settings - Fork 78
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
6ca1430
commit c60618a
Showing
2 changed files
with
61 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
59 changes: 59 additions & 0 deletions
59
.../_posts/2020-09-14-MONet Unsupervised Scene Decomposition and Representation.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
--- | ||
layout: post | ||
title: MONet - Unsupervised Scene Decomposition and Representation | ||
comments: True | ||
excerpt: | ||
tags: ['2019', 'Object-Oriented Learning', AI, Attention, CV, Unsupervised] | ||
|
||
|
||
--- | ||
|
||
## Introduction | ||
|
||
* The paper introduces Multi-Object Network (MONet) architecture that learns a modular representation of images by spatially decomposing scenes into *objects* and learning a representation for these *objects*. | ||
|
||
* [Link to the paper](https://arxiv.org/abs/1901.11390) | ||
|
||
## Architecture | ||
|
||
* Two components: | ||
|
||
* Attention Module: generates spatial masks corresponding to the *objects* in the scene. | ||
|
||
* VAE: learn representation for each *object*. | ||
|
||
* VAE components: | ||
|
||
* Encoder: It takes as input the image and the attention mask generated by the attention module and produce the parameters for distribution over latent variable *z*. | ||
|
||
* Decoder: It takes as input the latent variable *z* and attempts to reproduce the image. | ||
|
||
* The decoder loss term is weighted by mask, i.e., the decoder tries to reproduce only those parts of the image that the attention mask focuses on. | ||
|
||
* The attention mechanism is auto-regressive with an ongoing state (called a scope) that tracks which parts of the image are not yet attended over. | ||
|
||
* In the last step, no attention mask is computed, and the previous scope is used as-is. This ensures that all the masks sum to 1. | ||
|
||
* The VAE also models the attention mask over the components, i.e., the probability that the pixels belong to a particular component. | ||
|
||
## Motivation | ||
|
||
* A model could efficiently process compositional visual scenes if it can exploit some recurring structures in the scene. | ||
|
||
* The paper validates this hypothesis by showing that an autoencoder performs better if it can build up the scenes compositionally, processing one mask at a time (these masks are ground-truth spatial masks) rather than processing the scene at once. | ||
|
||
## Results | ||
|
||
* VAE encoder parameterizes a diagonal Gaussian latent posterior with a spatial broadcast decoder that encourages the VAE to learn disentangled features. | ||
|
||
* MONet with seven slots is trained on *Objects Room* dataset with 1-3 objects. | ||
|
||
* It learns to generate different attention mask for different objects. | ||
|
||
* Combining the reconstructed components using the corresponding attention masks produces good quality reconstruction for the entire scene. | ||
|
||
* Since it is an autoregressive model, MONet can be evaluated for more slots. The model generalizes to novel scene configurations (not seen during training). | ||
|
||
* On the Multi-dSprites dataset (modification of the dSprites dataset), the model (post-training) distinguishes individual sprites and background. | ||
|
||
* On the CLEVER data (2-10 objects per image), the model generates good image segmentation and reconstructions and can distinguish between overlapping shapes. |