Add toolformer paper

shagunsodhani · Feb 12, 2023 · 8301c9e · 8301c9e
1 parent 5bc0ac5
commit 8301c9e
Show file tree

Hide file tree

Showing 200 changed files with 47,627 additions and 2 deletions.
diff --git a/_site/404.html b/_site/404.html
@@ -0,0 +1,4 @@
+<div class="page">
+  <h1 class="page-title">404: Page not found</h1>
+  <p class="lead">Sorry, we've misplaced that URL or it's pointing to something that doesn't exist. <a href="/">Head back home</a> to try finding it again.</p>
+</div>
diff --git a/_site/README.md b/_site/README.md
diff --git a/_site/assets/BatchNormalization/eq1.png b/_site/assets/BatchNormalization/eq1.png
diff --git a/_site/assets/BatchNormalization/eq2.png b/_site/assets/BatchNormalization/eq2.png
diff --git a/_site/assets/FewThingsAboutML/BiasVarianceDiagram.png b/_site/assets/FewThingsAboutML/BiasVarianceDiagram.png
diff --git a/_site/assets/HNN/equation1.png b/_site/assets/HNN/equation1.png
diff --git a/_site/assets/HNN/equation2.png b/_site/assets/HNN/equation2.png
diff --git a/_site/assets/RNTN/MVRNN.png b/_site/assets/RNTN/MVRNN.png
diff --git a/_site/assets/RNTN/P1RNTN.png b/_site/assets/RNTN/P1RNTN.png
diff --git a/_site/assets/RNTN/P2RNTN.png b/_site/assets/RNTN/P2RNTN.png
diff --git a/_site/assets/RNTN/ParseTreeMVRNN.png b/_site/assets/RNTN/ParseTreeMVRNN.png
diff --git a/_site/assets/RNTN/RNN.png b/_site/assets/RNTN/RNN.png
diff --git a/_site/assets/RNTN/RNNModels.png b/_site/assets/RNTN/RNNModels.png
diff --git a/_site/assets/Swish/plot.png b/_site/assets/Swish/plot.png
diff --git a/_site/assets/topk/eq1.png b/_site/assets/topk/eq1.png
diff --git a/_site/assets/topk/eq2.png b/_site/assets/topk/eq2.png
diff --git a/_site/site/2017/04/27/VQA-Visual-Question-Answering.html b/_site/site/2017/04/27/VQA-Visual-Question-Answering.html
@@ -0,0 +1,106 @@
+<h3 id="problem-statement">Problem Statement</h3>
+
+<ul>
+  <li>
+    <p>Given an image and a free-form, open-ended, natural language question (about the image), produce the answer for the image.</p>
+  </li>
+  <li>
+    <p><a href="https://arxiv.org/abs/1505.00468v6">Link to the paper</a></p>
+  </li>
+</ul>
+
+<h3 id="vqa-challenge-and-workshop"><a href="http://www.visualqa.org/">VQA Challenge and Workshop</a></h3>
+
+<ul>
+  <li>The authors organise an annual challenge and workshop to discuss the state-of-the-art methods and best practices in this domain.</li>
+  <li>Interestingly, the second version is starting on 27th April 2017 (today).</li>
+</ul>
+
+<h3 id="benefits-over-tasks-like-image-captioning">Benefits over tasks like image captioning:</h3>
+
+<ul>
+  <li>Simple, <em>n-gram</em> statistics based methods are not sufficient.</li>
+  <li>Requires the system to blend in different aspects of knowledge - object detection, activity recognition, commonsense reasoning etc.</li>
+  <li>Since only short answers are expected, evaluation is easier.</li>
+</ul>
+
+<h3 id="dataset">Dataset</h3>
+
+<ul>
+  <li>Created a new dataset of 50000 realistic, abstract images.</li>
+  <li>Used AMT to crowdsource the task of collecting questions and answers for MS COCO dataset (&gt;200K images) and abstract images.</li>
+  <li>Three questions per image and ten answers per question (along with their confidence) were collected.</li>
+  <li>The entire dataset contains over 760K questions and 10M answers.</li>
+  <li>The authors also performed an exhaustive analysis of the dataset to establish its diversity and to explore how the content of these question-answers differ from that of standard image captioning datasets.</li>
+</ul>
+
+<h3 id="highlights-of-data-collection-methodology">Highlights of data collection methodology</h3>
+
+<ul>
+  <li>Emphasis on questions that require an image, and not just common sense, to be answered correctly.</li>
+  <li>Workers were shown previous questions when writing new questions to increase diversity.</li>
+  <li>Answers collected from multiple users to account for discrepancies in answers by humans.</li>
+  <li>Two modalities supported:
+    <ul>
+      <li><strong>Open-ended</strong> - produce the answer</li>
+      <li><strong>multiple-choice</strong> - select from a set of options provided (18 options comprising of popular, plausible, random and ofc correct answer)</li>
+    </ul>
+  </li>
+</ul>
+
+<h3 id="highlights-from-data-analysis">Highlights from data analysis</h3>
+
+<ul>
+  <li>Most questions range from four to ten words while answers range from one to three words.</li>
+  <li>Around 40% questions are “yes/no” questions.</li>
+  <li>Significant (&gt;80%) inter-human agreement for answers.</li>
+  <li>The authors performed a study where human evaluators were asked to answer the questions without looking at the images.</li>
+  <li>Further, they performed a study where evaluators were asked to label if a question could be answered using common sense and what was the youngest age group, they felt, could answer the question.</li>
+  <li>The idea was to establish that a sufficient number of questions in the dataset required more than just common sense to answer.</li>
+</ul>
+
+<h3 id="baseline-models">Baseline Models</h3>
+
+<ul>
+  <li><strong>random</strong> selection</li>
+  <li><strong>prior (“yes”)</strong> - always answer as yes.</li>
+  <li><strong>per Q-type prior</strong> - pick the most popular answer per question type.</li>
+  <li><strong>nearest neighbor</strong> - find the k nearest neighbors for the given (image, question) pair.</li>
+</ul>
+
+<h3 id="methods">Methods</h3>
+
+<ul>
+  <li>
+    <p>2-channel model (using vision and language models) followed by softmax over (K = 1000) most frequent answers.</p>
+  </li>
+  <li><strong>Image Channel</strong>
+    <ul>
+      <li><strong>I</strong> - Used last hidden layer of VGGNet to obtain 4096-dim image embedding.</li>
+      <li><strong>norm I</strong> - : l2 normalized version of <strong>I</strong>.</li>
+    </ul>
+  </li>
+  <li><strong>Question Channel</strong>
+    <ul>
+      <li><strong>BoW Q</strong> - Bag-of-Words representation for the questions using the top 1000 words plus the top 1- first, second and third words of the questions.</li>
+      <li><strong>LSTM Q</strong> - Each word is encoded into 300-dim vectors using fully connected + tanh non-linearity. These embeddings are fed to an LSTM to obtain 1024d-dim embedding.</li>
+      <li><strong>Deeper LSTM Q</strong> - Same as LSTM Q but uses two hidden layers to obtain 2048-dim embedding.</li>
+    </ul>
+  </li>
+  <li><strong>Multi-Layer Perceptron (MLP)</strong> - Combine image and question embeddings to obtain a single embedding.
+    <ul>
+      <li><strong>BoW Q + I</strong> method - concatenate BoW Q and I embeddings.</li>
+      <li><strong>LSTM Q + I, deeper LSTM Q + norm I</strong> methods - image embedding transformed to 1024-dim using a FC layer and tanh non-linearity followed by element-wise multiplication of image and question vectors.</li>
+    </ul>
+  </li>
+  <li>Pass combined embedding to an MLP - FC neural network with 2 hidden layers (1000 neurons and 0.5 dropout) with tanh, followed by softmax.</li>
+  <li>Cross-entropy loss with VGGNet parameters frozen.</li>
+</ul>
+
+<h3 id="results">Results</h3>
+
+<ul>
+  <li>Deeper LSTM Q + norm I is the best model with 58.16% accuracy on open-ended dataset and 63.09% on multiple-choice but far behind the human evaluators (&gt;80% and &gt;90% respectively).</li>
+  <li>The best model performs well for answers involving common visual objects but performs poorly for answers involving counts.</li>
+  <li>Vision only model performs even worse than the model which always produces “yes” as the answer.</li>
+</ul>
diff --git a/_site/site/2017/04/28/Simple-Baseline-for-Visual-Question-Answering.html b/_site/site/2017/04/28/Simple-Baseline-for-Visual-Question-Answering.html
@@ -0,0 +1,34 @@
+<h3 id="problem-statement">Problem Statement</h3>
+
+<ul>
+  <li>VQA Task: Given an image and a free-form, open-ended, natural language question (about the image), produce the answer for the image.</li>
+  <li>The paper attempts to fine tune the simple baseline method of Bag-of-Words + Image features (iBOWIMG) to make it competitive against more sophisticated LSTM models.</li>
+  <li><a href="http://arxiv.org/pdf/1512.02167.pdf">Link to the paper</a></li>
+</ul>
+
+<h3 id="model">Model</h3>
+
+<ul>
+  <li>VQA modelled as a classification task where the system learns to choose among one of the top k most prominent answers.</li>
+  <li><strong>Text Features</strong> - Convert input question to a one-hot vector and then transform to word vectors using a word embedding.</li>
+  <li><strong>Image Features</strong> - Last layer activations from GoogLeNet.</li>
+  <li>Text features are concatenated with image features and fed into a softmax.</li>
+  <li>Different learning rates and weight clipping for word embedding layer and softmax layer with the learning rate for embedding layer much higher than that of softmax layer.</li>
+</ul>
+
+<h3 id="results">Results</h3>
+
+<ul>
+  <li>iBOWIMG model reports an accuracy of 55.89% for Open-ended questions and 61.97% for Multiple-Choice questions which is comparable to the performance of other, more sophisticated models.</li>
+</ul>
+
+<h3 id="interpretation-of-the-model">Interpretation of the model</h3>
+
+<ul>
+  <li>Since the model is very simple, it is possible to interpret the model to know what exactly is the model learning. This is the greatest strength of the paper even though the model is very simple and naive.</li>
+  <li>The model attempts to memorise the correlation between the answer class and the informative words (in the question) and image features.</li>
+  <li>Question words generally can influence the answer given the bias in images occurring in COCO dataset.</li>
+  <li>Given the simple linear transformation being used, it is possible to quantify the importance of each single words (in the question) to the answer.</li>
+  <li>The paper uses the Class Activation Mapping (CAM) approach (which uses the linear relation between softmax and final image feature map) to highlight the informative image regions relevant to the predicted answer.</li>
+  <li>While the results reported by the paper are not themselves so significant, the described approach provides a way to interpret the strengths and weakness of different VQA datasets.</li>
+</ul>
diff --git a/_site/site/2017/05/07/Conditional-Similarity-Networks.html b/_site/site/2017/05/07/Conditional-Similarity-Networks.html
@@ -0,0 +1,103 @@
+<h2 id="problem-statement">Problem Statement</h2>
+
+<ul>
+  <li>A common way of measuring image similarity is to embed them into feature spaces where distance acts as a proxy for similarity.</li>
+  <li>But this feature space can capture one (or a weighted combination) of the many possible notions of similarity.</li>
+  <li>What if contracting notions of similarity could be captured at the same time - in terms of semantically distinct subspaces.</li>
+  <li>The paper proposes a new architecture called as Conditional Similarity Networks (CSNs) which learns a disentangled embedding such that the features, for different notions of similarity, are encoded into separate dimensions.</li>
+  <li>It jointly learns masks (or feature extractors) that select and reweights relevant dimensions to induce a subspace that encodes a specific notion of similarity.</li>
+  <li><a href="https://vision.cornell.edu/se3/conditional-similarity-networks/">Link to the paper</a></li>
+</ul>
+
+<h2 id="conditional-similarity-networks">Conditional Similarity Networks</h2>
+
+<ul>
+  <li>Given an image, <em>x</em>, learn a non-linear feature embedding <em>f(x)</em> such that for any 2 images <em>x<sub>1</sub></em> and <em>x<sub>2</sub></em>, the euclidean distance between <em>f(x<sub>1</sub>)</em> and <em>f(x<sub>2</sub>)</em> reflects their similarity.</li>
+</ul>
+
+<h3 id="conditional-similarity-triplets">Conditional Similarity Triplets</h3>
+
+<ul>
+  <li>Given a triplet of images <em>(x<sub>1</sub>, x<sub>2</sub>, x<sub>3</sub>)</em> and a condition <em>c</em> (the notion of similarity), an oracle (say crowd) is used to determmine if <em>x<sub>1</sub></em> is more similar to <em>x<sub>2</sub></em> or <em>x<sub>3</sub></em> as per the given criteria <em>c</em>.</li>
+  <li>In general, for images <em>i, j, l</em>, the triplet <em>t</em> is ordered {i, j, l | c} if <em>i</em> is more similar to <em>j</em> than <em>l</em>.</li>
+</ul>
+
+<h3 id="learning-from-triplets">Learning From Triplets</h3>
+
+<ul>
+  <li>Define a loss function <em>L<sub>T</sub>()</em> to model the similarity structure over the triplets.</li>
+  <li><em>L<sub>T</sub>(i, j, l) = max{0, D(i, j) - D(i, l) + h}</em> where <em>D</em> is the euclidean distance function and <em>h</em> is the similarity scalar margin to prevent trivial solutions.</li>
+  <li>To model conditional similarities, masks <em>m</em> are defined as <em>m = σ(β)</em> where σ is the RELU unit and β is a set of parameters to be learnt.</li>
+  <li><em>m<sub>c</sub></em> denotes the selection of the c-th mask column from feature vector. It thus acts as an element-wise gating function which selects the relevant dimensions of the embedding to attend to a particular similarity concept.</li>
+  <li>The euclidean function <em>D</em> now computes the masked distance (<em>f(i, c)m<sub>c</sub></em>) between the two given images.</li>
+  <li>Two regularising terms are also added - L2 norm for <em>D</em> and L1 norm for <em>m</em>.</li>
+</ul>
+
+<h2 id="experiments">Experiments</h2>
+
+<h3 id="datasets">Datasets</h3>
+
+<ul>
+  <li>Fonts dataset by Bernhardsson
+    <ul>
+      <li>3.1 million 64 by 64-pixel grey scale images.</li>
+    </ul>
+  </li>
+  <li>Zappos50k shoe dataset
+    <ul>
+      <li>Contains 50,000 images of individual richly annotated shoes.</li>
+      <li>Characteristics of interest:
+        <ul>
+          <li>Type of the shoes (i.e., shoes, boots, sandals or slippers)</li>
+          <li>Suggested gender of the shoes (i.e., for women, men, girls or boys)</li>
+          <li>Height of the shoes’ heels (0 to 5 inches)</li>
+          <li>Closing mechanism of the shoes (buckle, pull on, slip on, hook and loop or laced up)</li>
+        </ul>
+      </li>
+    </ul>
+  </li>
+</ul>
+
+<h3 id="models">Models</h3>
+
+<ul>
+  <li>Initial model for the experiments is a ConvNet pre-trained on ImageNet</li>
+  <li><strong>Standard Triplet Network</strong>
+    <ul>
+      <li>Learn from all available triplets jointly as if they have the same notion of similarity.</li>
+    </ul>
+  </li>
+  <li><strong>Set of Task Specific Triplet Networks</strong>
+    <ul>
+      <li>Train n separate triplet networks such that each is trained on a single notion of similarity.</li>
+      <li>Needs far more parameters and compute.</li>
+    </ul>
+  </li>
+  <li><strong>Conditional Similarity Networks - fixed disjoint masks</strong>
+    <ul>
+      <li>In this version, only the convolutional filters and the embedding is learnt and masks are predefined to be disjoint.</li>
+      <li>Aims to learn a fully disjoint embedding.</li>
+    </ul>
+  </li>
+  <li><strong>Conditional Similarity Networks - learned masks</strong>
+    <ul>
+      <li>Learns all the components - conv filters, embedding and the masks.</li>
+    </ul>
+  </li>
+  <li>Refer paper for details on hyperparameters.</li>
+</ul>
+
+<h2 id="results">Results</h2>
+
+<ul>
+  <li>Visual exploration of the learned subspaces (t-sne visualisation) show that network successfully disentangles different features in the embedded vector space.</li>
+  <li>The learned masks are very sparse and share dimensions. This shows that CSNs may learn to only use the required number of dimensions thereby doing away with the need of picking the right size of embedding.</li>
+  <li>Order of performance:
+    <ul>
+      <li>CSNs with learned masks &gt; CSNs with fixed masks &gt; Task-specific networks &gt; standard triplet network.</li>
+      <li>Though CSNs with learned masks require more training data.</li>
+    </ul>
+  </li>
+  <li>CSNs also outperform Standard Triplet Network when used as off the shelf features for (brand) classification task and is very close to the performance of ResNet trained on ImageNet.</li>
+  <li>This shows that while CSN retained most of the information in the original network, the training mechanism of Standard Triplet Network hurts the underlying conv features and their generalising capability</li>
+</ul>
diff --git a/...in-VQA-Matter-Elevating-the-Role-of-Image-Understanding-in-Visual-Question-Answering.html b/...in-VQA-Matter-Elevating-the-Role-of-Image-Understanding-in-Visual-Question-Answering.html
@@ -0,0 +1,38 @@
+<h3 id="problem-statement">Problem Statement</h3>
+
+<ul>
+  <li>Standard VQA models benefit from the inherent bias in the structure of the world and the language of the question.</li>
+  <li>For example, if the question starts with “Do you see a …”, it is more likely to be “yes” than “no”.</li>
+  <li>To truly assess the capability of any VQA system, we need to have evaluation tasks that require the use of both the visual and the language modality.</li>
+  <li>The authors present a balanced version of <a href="https://shagunsodhani.in/papers-I-read/VQA-Visual-Question-Answering">VQA dataset</a> where each question in the dataset is associated with a pair of similar images such that the same question would give different answers on the two images.</li>
+  <li>The proposed data collection procedure enables the authors to develop a novel interpretable model which, given an image and a question, identifies an image that is similar to the original image but has a different answer to the same question thereby building trust for the system.</li>
+  <li><a href="https://arxiv.org/abs/1612.00837">Link to the paper</a></li>
+</ul>
+
+<h3 id="dataset-collection">Dataset Collection</h3>
+
+<ul>
+  <li>Given an (image, question, answer) triplet (I, Q, A) from the VQA dataset, a human worker (on AMT) is asked to identify an image I’ which is similar to I but for which the answer to question Q is A’ (different from A).</li>
+  <li>To facilitate the search for I’, the worker is shown 24 nearest-neighbor images of I (based on VGGNet features) and is asked to choose the most similar image to I, for which Q makes sense and answer for Q is different than A. In case none of the 24 images qualifies, the worker may select “not possible”.</li>
+  <li>In the second round, the workers were asked to answer Q for I’.</li>
+  <li>This 2-stage protocol results in a significantly more balanced dataset than the previous dataset.</li>
+</ul>
+
+<h3 id="observation">Observation</h3>
+
+<ul>
+  <li>State-of-the-art models trained on unbalanced VQA dataset perform significantly worse on the new, balanced dataset indicating that those models benefitted from the language bias in the older dataset.</li>
+  <li>Training on balanced dataset improves performance on the unbalanced dataset.</li>
+  <li>Further, the VQA model, trained on the balanced dataset, learns to differentiate between otherwise similar images.</li>
+</ul>
+
+<h3 id="counter-example-explanations">Counter-example Explanations</h3>
+
+<ul>
+  <li>Given an image and a question, the model not only answers the question, it also provides an image (from the k nearest neighbours of I, based on VGGNet features) which is similar to the input image but for which the model would have given different answer for the same image.</li>
+  <li>Supervising signal is provided by the data collection procedure where humans pick the image I’ from the same set of candidate images.</li>
+  <li>For each image in the candidate set, compute the inner product of question-image embedding and answer embedding.</li>
+  <li>The K inner product values are passed through a fully connected layer to generate K scores.</li>
+  <li>Trained with pairwise hinge ranking loss so that the score of the human picked image is higher than the score of all other images by a margin of M (hyperparameter).</li>
+  <li>The proposed explanation model achieves a recall@5 of 43.49%</li>
+</ul>