-
Notifications
You must be signed in to change notification settings - Fork 78
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
8191a01
commit 7216d80
Showing
4 changed files
with
106 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
59 changes: 59 additions & 0 deletions
59
site/_posts/2018-01-14-Exploring Models and Data for Image Question Answering.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
--- | ||
layout: post | ||
title: Exploring Models and Data for Image Question Answering | ||
comments: True | ||
excerpt: Given an image, answer a given question about the image. | ||
tags: ['2015', 'NIPS 2015', AI, CV, Dataset, NIPS, NLP, VQA] | ||
--- | ||
|
||
## Introduction | ||
|
||
* **Problem Statement**: Given an image, answer a given question about the image. | ||
|
||
* [Link to the paper](https://arxiv.org/abs/1505.02074) | ||
|
||
* **Assumptions**: | ||
* The answer is assumed to be a single word thereby bypassing the evaluation issues of multi-word generation tasks. | ||
|
||
## VIS-LSTM Model | ||
|
||
* Treat the input image as the first word in the question. | ||
* Obtain the vector representation (skip-gram) for words in the question. | ||
* Obtain the VGG Net embeddings of the image and use a linear transformation (dimensionality reduction weight matrix) to match the dimensions of word embeddings. | ||
* Keep image embedding frozen during training and use an LSTM to combine the word vectors. | ||
* LSTM outputs are fed into a softmax layer which generates the answer. | ||
|
||
## Dataset | ||
|
||
* DAtaset for QUestion Ansering on Real-world images (DAQUAR) | ||
* 1300 images and 7000 questions with 37 object classes. | ||
* Downside is that even guess work can yield good results. | ||
* The paper proposed an algorithm for generating questions using MS-COCO dataset. | ||
* Perform preprocessing steps like breaking large sentences and changing indefinite determines to definite ones. | ||
* *object* questions, *number* questions, *colour* questions and *location* questions can be generated by searching for nouns, numbers, colours and prepositions respectively. | ||
* Resulting dataset has ~120K questions across above 4 semantic types. | ||
|
||
## Models | ||
|
||
* VIS+LSTM - explained above | ||
* 2-VIS+BLSTM - Add the image features twice, in beginning and in the end (using different linear transformations) plus use bidirectional LSTM | ||
* IMG+BOW - Multinomial logistic regression on image features without dimensionality reduction + bag of words (averaging word vectors). | ||
* FULL - Simple average of above 2 models. | ||
|
||
### Baseline | ||
|
||
* Includes models where the answer is guessed, or only image or question features are used or image features along with prior knowledge of object are used. | ||
* Also includes a KNN model where the system finds the nearest (image, question) pair. | ||
|
||
### Metrics | ||
|
||
* Accuracy | ||
* Wu-Palmer similarity measure | ||
|
||
## Observations | ||
|
||
* The VIS-LSTM model outperforms the baselines while the FULL model benefits from averaging across all the models. | ||
* Some useful information seems to be lost when downsizing the VGG vectors. | ||
* Fine tuning the word vectors helps with performance. | ||
* Normalising CNN hidden image features into zero mean and unit variance leads to faster training. | ||
* Model does not perform well on the task of considering spatial relations between multiple objects and counting objects when multiple objects are present |
43 changes: 43 additions & 0 deletions
43
...nsfer in Natural Language Generation Systems Using Recurrent Neural Networks.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
--- | ||
layout: post | ||
title: Stylistic Transfer in Natural Language Generation Systems Using Recurrent Neural Networks | ||
comments: True | ||
excerpt: The paper explores the problem of style transfer in natural language generation. | ||
tags: ['2016', 'ACL 2016', ACL, AI, NLG, NLP, Workshop] | ||
--- | ||
|
||
## Introduction | ||
|
||
* [This workshop paper](https://aclweb.org/anthology/W/W16/W16-6010.pdf) explores the problem of style transfer in natural language generation (NLG). | ||
* One possible manifestation would be rewriting technical articles in an easy-to-understate manner. | ||
|
||
## Challenges | ||
|
||
* Identifying relevant stylistic cues and using them to control text generation in NLG systems. | ||
* Absence of a large amount of training data. | ||
|
||
## Pitch | ||
|
||
* Using Recurrent Neural Networks (RNNs) to disentangle the style from semantic content. | ||
* Autoencoder model with two components - one for learning style and another for learning content. | ||
* This allows for "style" component to be replaced while keeping the "content" component same, resulting in a style transfer. | ||
* One way to think about this is - the encoder generates a 100-dimensional vector. In this, the first 50 entries, correspond to the "style" component and remaining to the "content" component. | ||
* The proposal is that the loss function should be modified to include a cross-covariance term for ensuring disentanglement. | ||
* I think one way of doing this is to have two loss functions: | ||
* The **first loss** function ensures that the input sentence is decoded properly into the target sentence. This loss is computed for each sentence. | ||
* The **second loss** ensures that the first 50 entries across all the encoded represenations are are correlated. This loss operates at the batch level. | ||
* The **total loss** is the weighted sum of these 2 losses. | ||
|
||
## Possible Datasets | ||
|
||
* [Complete works of Shakespeare](http://norvig.com/ngrams/shakespeare.txt) | ||
* [Wikpedia Kaggle dataset](https://www.kaggle.com/c/wikichallenge/data) | ||
* [Oxford Text Archive](https://ota.ox.ac.uk/) | ||
* Twitter data | ||
|
||
## Possible Metrics | ||
|
||
* Soundness - is the generated text entailed with the input sentence. | ||
* Coherence - free of grammatical errors, proper word usage etc. | ||
* Effectiveness - how effective was the style transfer | ||
* Since some of the metrics are subjective, human evaluators also need to be employed. |