diff --git a/_site/404.html b/_site/404.html new file mode 100644 index 00000000..b7c3474c --- /dev/null +++ b/_site/404.html @@ -0,0 +1,4 @@ +
Sorry, we've misplaced that URL or it's pointing to something that doesn't exist. Head back home to try finding it again.
+Given an image and a free-form, open-ended, natural language question (about the image), produce the answer for the image.
+2-channel model (using vision and language models) followed by softmax over (K = 1000) most frequent answers.
+Given two sentences a and b, the model has to predict whether they have an “entailment” relationship, “neutral” relationship or “contradiction” relationship.
+The current trend in deep learning is to design, train and fine tune a separate model for each problem.
+Though multi-task models have been explored, they have been trained for problems from the same domain only and no competitive multi-task, multi-modal models have been proposed.
+The paper explores the possibility of such a unified deep learning model that can solve different tasks across multiple domains by training concurrently on them.
+Small, modality-specific subnetworks (called modality nets) should be used to map input data to a joint representation space and back.
+ +The joint representation is to be of variable size.
+Different tasks from the same domain share the modality net.
+MultiModel networks should use computational blocks from different domains even if they are not specifically designed for the task at hand.
+ +MulitModel Network consists of few, small modality nets, an encoder, I/O mixer and an autoregressive decoder.
+Encoder and decoder use the following computational blocks:
+ +Convolutional Block
+ +Attention Block
+ +Mixture-of-Experts (MoE) Block
+ +For further details, refer the original paper.
+Encoder consists of 6 conv blocks with a MoE block in the middle.
+I/O mixer consists of an attention block and 2 conv blocks.
+Decoder consists of 4 blocks of convolution and attention with a MoE block in the middle.
+Modality Nets
+ +Language Data
+ +Input is the sequence of tokens ending in a termination token.
+This sequence is mapped to correct dimensionality using a learned embedding.
+For output, the network takes the decoded output and performs a learned linear mapping followed by Softmax.
+Image and Categorical Data
+ +Uses residual convolution blocks.
+Similar to the exit flow for Xception Network
+Audio Data
+ +WSJ speech corpus
+ImageNet dataset
+COCO image captioning dataset
+WSJ parsing dataset
+WMT English-German translation corpus
+German-English translation
+WMT English-French translation corpus
+German-French translation
+The experimental section is not very rigorous with many details skipped (would probably be added later).
+While MultiModel does not beat the state of the art models, it does outperform some recent models.
+Jointly trained model performs similar to single trained models on tasks with a lot of data and sometimes outperformed single trained models on tasks with less data (like parsing).
+Interestingly, jointly training the model for parsing task and Imagenet tasks improves the performance of parsing task even though the two tasks are seemingly unrelated.
+Another experiment was done to evaluate the effect of components (like MoE) on tasks (like Imagenet) which do not explicitly need them. It was observed that either the performance either went down or remained the same when MoE component was removed. This indicates that mixing different components does help to improve performance over multiple tasks.
+But this observation is not conclusive as a different combination of say the encoder (that does not use MoE) could achieve better performance than one that does. The paper does not explore possibilities like these.
+Dynamic Memory Networks (DMN) is a neural network based general framework that can be used for tasks like sequence tagging, classification, sequence to sequence and question answering requiring transitive reasoning.
+The basic idea is that all these tasks can be modelled as question answering task in general and a common architecture could be used for solving them.
+Concatenate all the sentences (or facts) in the document and encode them by feeding the word embeddings of the text to a GRU.
+Each time a sentence ends, extract the hidden representation of the GRU till that point and use as the encoded representation of the sentence.
+Episodic memory consists of an attention mechanism and a recurrent network with which it updates its memory.
+During each iteration, the network generates an episode e by attending over the representation of the sentences, question and the previous memory.
+The episodic memory is updated using the current episode and the previous memory.
+Depending on the amount of supervision available, the network may perform multiple passes. eg, in the bAbI dataset, some tasks specify how many passes would be needed and which sentence should be attended to in each pass. For others, a fixed number of passes are made.
+Multiple passes allow the network to perform transitive inference.
+Given the input representation c, memory m and question q, produce a scalar score using a 2-layer feedforward network, to use as attention mechanism.
+A separate GRU encodes the input representation and weights it by the attention.
+Final state of the GRU is fed to the answer module.
+bAbI Dataset
+For most tasks, DMN either outperforms or performs as good as Memory Networks.
+For tasks like answering with 2 or 3 supporting facts, DMN lags because of limitation of RNN in modelling long sentences.
+Stanford Sentiment Treebank Dataset
+DMN outperforms all the baselines for both binary and fine-grained sentiment analysis.
+Wall Street Journal Dataset
+DMN archives state of the art accuracy of 97.56%
+Multiple passes help in reasoning tasks but not so much for sentiment/POS tags.
+Attention in the case of 2-iteration DMN is more focused than attention in 1-iteration DMN.
+For 2-iteration DMN, attention in the second iteration focuses only on relevant words and less attention is paid to words that lose their relevance in the context of the entire document.
+It would be interesting to put some mechanism in place to determine the number of episodes that should be generated before an answer is predicted. A naive way would be to predict the answer after each episode and check if the softmax score of the predicted answer is more than a threshold.
+Alternatively, the softmax score and other information could be fed to a Reinforcement Learning (RL) agent which decided if the document should be read again. So every time an episode is generated, the state is passed to the RL agent which decides if another iteration should be performed. If it decides to predict the answer and correct answer is generated, the agent gets a large +ve reward else a large -ve reward.
+To discourage unnecessary iterations, a small -ve reward could be given everytime the agent decides to perform another iteration.
+Given a pre-trained neural network, which is trained using data from some distribution P (referred to as in-distribution data), the task is to detect the examples coming from a distribution Q which is different from P (referred to as out-of-distribution data).
+For example, if a digit recognizer neural network is trained using MNIST images, an out-of-distribution example would be images of animals.
+Neural Networks can make high confidence predictions even in such cases where the input is unrecognisable or irrelevant.
+The paper proposes ODIN which can detect such out-of-distribution examples without changing the pre-trained model itself.
+Uses 2 major techniques
+ +Softmax classifier for the classification network can be written as:
+ +pi(x, T) = exp(fi(x)/T) / sum(exp(fj(x)/T))
+where x is the input, p is the softmax probability and T is the temperature scaling parameter.
+ +Add small perturbations to the input (image) before feeding it into the network.
+x_perturbed = x - ε * sign(-δxlog(py(x, T)))
+where ε is the perturbation magnitude
+ +Code available on github
+Models
+ +In-Distribution Datasets
+ +Out-of-Distribution Datasets
+ +Metrics
+ +ODIN outperforms the baseline across all datasets and all models by a good margin.
+In the domain of machine comprehension, making multiple passes over the given document is an effective technique to extract the relation between the given passage, question and answer.
+Unlike previous approaches, which perform a fixed number of passes over the passage, Reasoning Network (ReasoNet) uses reinforcement learning (RL) to decide how many times a document should be read.
+Every time the document is read, ReasoNet determines whether the document should be read again or has the termination state been reached. If termination state is reached, the answer module is triggered to generate the answer.
+Since the termination state is discrete and not connected to the final output, RL approach is used.
+CNN, DailyMail Dataset
+SQuAD
+Graph Reachability Dataset
+Memory (M) - Comprises of the vector representation of the document and the question (encoded using GRU or other RNNs).
+Attention - Attention vector (xt) is a function of current internal state st and external memory M. The state and memory are passed through FCs and fed to a similarity function.
+Internal State (st) - Vector representation of the question state computed by a RNN using the previous internal state and the attention vector xt
+Termination Gate (Tt) - Uses a logistic regression model to generate a random binary variable using the current internal state st.
+Reinforcement Learning - For the RL setting, reward at time t, rt = 1 if Tt = 1 and answer is correct. Otherwise rt = 0
+Workflow - Given a passage p, query q and answer a:
+ +Extract memory using p
+Extract initial hidden state using q
+ReasoNet executes all possible episodes that can be enumerated by setting an upper limit on the number of passes.
+These episodes generate actions and answers that are used to train the ReasoNet.
+Result
+ +CNN, DailyMail Corpus
+ +SQuAD
+ +Graph Reachability Dataset
+ +ReasoNet - Standard ReasoNet as described above.
+ReasoNet-Last - Use the prediction from the Tmax
+ReasoNet > ReasoNet-Last > Deep LSTM Reader
+ReasoNet converges faster than ReasoNet-Last indicating that the terminate gate is useful.
+Notes
+ +R-NET is an end-to-end trained neural network model for machine comprehension.
+It starts by matching the question and the given passage (using gated attention based RNN) to obtain question-aware passage representation.
+Next, it uses a self-matching attention mechanism to refine the passage representation by matching the passage against itself.
+Lastly, it uses pointer networks to determine the position of the answer in the passage.
+SQuAD
+MS-MARCO
+Question / Passage Encoder
+ +Gated Attention based RNN
+ +Given question and passage representation, sentence pair representation is generated via soft-alignment of the words in the question and in the passage.
+The newly added gate captures the relation between the question and the current passage word as only some parts of the passage are relevant for answering the given question.
+Self Matching Attention
+ +The passage representation obtained so far would not capture most of the context.
+So the current representation is matched against itself so as to collect evidence from the entire passage and encode the evidence relevant to the current passage word and question.
+Output Layer
+ +Use pointer network (initialized using attention pooling over answer representation) to predict the position of the answer.
+Loss function is the sum of negative log probabilities of start and end positions.
+Results
+ +R-NET is ranked second on SQuAD Leaderboard as of 7th August, 2017 and achieves best-published results on MS-MARCO dataset.
+Using ideas like sentence ranking, using syntax information performing multihop inference and augmenting question dataset (using seqToseq network) do not help in improving the performance.
+Word based language models suffer from the problem of rare or Out of Vocabulary (OOV) words.
+Learning representations for OOV words directly on the end task often results in poor representation.
+The alternative is to replace all the rare words with a single, unique representation (loss of information) or use character level models to obtain word representations (they tend to miss on the semantic relationship).
+The paper proposes to learn a network that can predict the representations of words using auxiliary data (referred to as definitions) such as dictionary definitions, Wikipedia infoboxes, the spelling of the word etc.
+The auxiliary data encoders are trained jointly with the end task to ensure that word representations align with the requirements of the end task.
+Given a rare word w, let d(w) = <x1, x2…> denote its defination where xi are words.
+d(w) is fed to a defination reader network f (LSTM) and its last state is used as the defination embedding ed(w)
+In case w has multiple definitions, the embeddings are combined using mean pooling.
+The approach can be extended to in-vocabulary words as well by using the definition embedding of such words to update their original embeddings.
+The proposed approach was tested on following tasks:
+ +For all the tasks, models using both spelling and dictionary (SD) outperformed the model using just one.
+Multi-token words like “San Francisco” are not accounted for now.
+The model does not handle the rare words which appear in the definition and just replaces them by the
The paper introduces a novel architecture that generates an output sequence such that the elements of the output sequence are discrete tokens corresponding to positions in the input sequence.
+Such a problem can not be solved using Seq2Seq or Neural Turing Machines as the size of the output softmax is variable (as it depends on the size of the input sequence).
+Traditional attention-base sequence-to-sequence models compute an attention vector for each step of the output decoder and use that to blend the individual context vectors of the input into a single, consolidated attention vector. This attention vector is used to compute a fixed size softmax.
+In Pointer Nets, the normalized attention vector (over all the tokens in the input sequence) is normalized and treated as the softmax output over the input tokens.
+So Pointer Net is a very simple modification of the attention model.
+Any problem where the size of the output depends on the size of the input because of which fixed length softmax is ruled out.
+eg combinatorial problems such as planar convex hull where the size of the output would depend on the size of the input.
+The paper considers the following 3 problems:
+ +Since some of the problems are NP hard, the paper considers approximate solutions whereever the exact solutions are not feasible to compute.
+The authors used the exact same architecture and model parameters of all the instances of the 3 problems to show the generality of the model.
+The proosed Pointer Nets outperforms LSTMs and LSTMs with attention and can generalise quite well for much larger sequences.
+Interestingly, the order in which the inputs are fed to the system affects its performance. The authors discussed this apsect in their subsequent paper titled Order Matters: Sequence To Sequence for Sets
+Raw - Original query is fed to the search engine without any modification.
+Pseudo-Relevance Feedback (PRF-TFIDF) - The query is expanded using the top-N TF-IDF terms.
+PRF-Relevance Model (PRF-RM) - Probability of adding token t to the query q0 is given by P(t|q0) = (1 − λ)P′(t|q0) + λ sum (P(d)P(t|d)P(q0|d))
+Reward - The paper uses Recall@K as a reward when training the RL-based models with the argument that the “metric has shown to be effective in improving the other metrics as well”, without any justification though.
+SL-Oracle - classifier that perfectly selects terms that will increase performance based on the supervised learning approach.
+The paper presents a new machine comprehension dataset for question answering in real life setting (say when interacting with Cortana/Siri).
+Existing machine comprehension (MC) datasets are either too small or synthetic (with a distribution different from that or real-questions posted by humans). MARCO questions are sampled from real, anonymized user queries.
+Most datasets would provide a comparatively small and clean context to answer the question. In MARCO, the context documents (which may or may not contain the answer) are extracted using Bing from real-world documents. As such the questions and the context documents are noisy.
+In general, the answer to the questions are restricted to an entity or text span within the document. In case of MARCO, the human judges are encouraged to generate complete sentences as answers.
+First release consists of 100K questions with the aim of releasing 1M questions in the future releases.
+All questions are tagged with segment information.
+A subset of questions has multiple answers and another subset has no answers at all.
+Each record in the dataset contains the following information:
+ +Metrics
+ +Among generative models, Memory Networks performed better than seq-to-seq.
+In the cloze-style test, ReasoNet achieved an accuracy of approx. 59% while Attention Sum Reader achieved an accuracy of approx 55%.
+Current QA systems (including the ones using memory and attention) derive their power from supervised data and are very different from how humans do reasoning.
+Imagenet dataset pushed the state-of-the-art performance on object classification to beyond human accuracy. Similar was the case with speech recognition dataset from DARPA which led to the advancement of speech recognition. Having a large, diverse and human-like questions dataset is a fundamental requirement to advance the field and the paper aims to provide just the right kind of dataset.
+The paper presents a new activation function called Swish with formulation f(x) = x.sigmod(x) and its parameterised version called Swish-β where f(x, β) = 2x.sigmoid(β.x) and β is a training parameter.
+The paper shows that Swish is consistently able to outperform RELU and other activations functions over a variety of datasets (CIFAR, ImageNet, WMT2014) though by small margins only in some cases.
+Smooth, non-monotonic function.
+Swish-β can be thought of as a smooth function that interpolates between a linear function and RELU.
+Uses self-gating mechanism (that is, it uses its own value to gate itself). Gating generally uses multiple scalar inputs but since self-gating uses a single scalar input, it can be used to replace activation functions which are generally pointwise.
+Being unbounded on the x>0 side, it avoids saturation when training is slow due to near 0 gradients.
+Being bounded below induces a kind of regularization effect as large, negative inputs are forgotten.
+Since the Swish function is smooth, the output landscape and the loss landscape are also smooth. A smooth landscape should be more traversable and less sensitive to initialization and learning rates.
+HARP is an architecture to learn low-dimensional node embeddings by compressing the input graph into smaller graphs.
+Given a graph G = (V, E), compute a series of successively smaller (coarse) graphs G0, …, GL. Learn the node representations in GL and successively refine the embeddings for larger graphs in the series.
+The architecture is independent of the algorithms used to embed the nodes or to refine the node representations.
+Graph coarsening technique that preserves global structure
+ +Collapse edges and stars to preserve first and second order proximity.
+Edge collapsing - select the subset of E such that no two edges are incident on the same vertex and merge their nodes into a single node and merge their edges as well.
+Star collapsing - given star structure, collapse the pairs of neighboring nodes (of the central node).
+In practice, first apply star collapsing, followed by edge collapsing.
+Extending node representation from coarse graph to finer graph
+ +Lets say node1 and node2 were merged into node12 during coarsening. First copy the representation of node12 into node1, node2.
+Additionally, if hierarchical softmax was used, extend the B-tree such that node12 is replaced by 2 child nodes node1 and node2.
+Time complexity for HARP + DeepWalk is O(number of walks * |V|) while for HARP + LINE is O(number of iterations * |E|).
+The asymptotic complexity remains the same as the HARP-less version for the two cases.
+Multilabel classification task shows that HAR improves all the node embedding technique with gains up to 14%.
+Symmetric Similarity
+The resulting loss function can be interpreted as pushing the means closer which encouraging the two gaussians to be more concentrated.
+Asymmetric Similarity
+A network motif is defined as “a pattern of inter-connections occurring in complex networks in numbers that are significantly higher than those in randomized networks”.
+In the practical setting, given an input network, we first create randomized networks which have same single node characteristics (like a number of incoming and outgoing edges) as the input network.
+The patterns that occur at a much higher frequency in the input graph (than the randomized graphs) are reported as motifs.
+More specifically, the patterns for which the probability of appearing in a randomized network an equal or more number of times than in the real network is lower than a cutoff value (say 0.01).
+Real-life networks exhibit properties like “small world” property ( the majority of nodes are within a distance of fewer than 7 hops from each other) and “scale-free” property (fraction of nodes having k edges decays as a power-law).
+Motifs are one such structural property that is exhibited by networks in biochemistry, neurobiology, ecology, and engineering. Further, motifs shared by graphs of different domains are different which hints at the usefulness of motifs as a fundamental structural property of the graph and relates to the process of evolution of the graph.
+The paper presents a generalized framework for graph clustering (clusters of network motifs) on the basis of higher-order connectivity patterns.
+Given a motif M, the framework aims to find a cluster of the set of nodes S such that nodes of S participate in many instances of M and avoid cutting instances of M (that is only a subset of nodes in instances of M appears in S).
+Mathematically, the aim is to minimise the motif conductance metric given as cutM(S, S’) / min[volM(S), volM(S’)] where S’ is complement of S, cutM(S, S’) = number of instances of M which have atleast one node from both S and S’ and volM(S) = Number of nodes in instances of M that belong only to S.
+Solving the above equation is computationally infeasible and an approximate solution is proposed using eigenvalues and matrices.
+The approximate solution is easy to implement, efficient and guaranteed to find clusters that are at most a quadratic factor away from the optimal.
+Given the network and motif M, form a motif adjacency matrix WM where WM(i, j) is the number of instances of M that contains i and j.
+Compute spectral ordering of the nodes from normalized motif laplacian matrix.
+Compute prefix set of spectral ordering with small motif conductance.
+Applicable to directed, undirected and weighted graphs (allows for negative edge weights as well).
+In case the motif is not known beforehand, the framework can be used to compute significant motifs.
+The proposed framework unifies the two fundamental tools of network science (motif analysis and network partitioning) along with some worst-case guarantees for the approximations employed and can be extended to identify higher order modular organization of networks.
+The problem is the following:
+ +We have a domain DS for which we have labelled dataset of question-answer pairs and another domain DT for which we do not have any labelled dataset.
+We use the data for domain DS to train SynNet and use that to generate synthetic question-answer pairs for domain DT.
+Now we can train a machine comprehension model M on DS and finetune using the synthetic data for DT.
+Works in two stages:
+ +For training, map the words to their GloVe embeddings and pass through a Bi-LSTM. Next, pass them through two-FC layers followed by a softmax layer.
+Given an input paragraph and a candidate answer, Question Synthesis network generates question one word at a time.
+Map each word in the paragraph to their GloVe embedding. After the word vector, append a ‘1’ if the word was part of the candidate answer else append a ‘0’.
+Feed to a Bi-LSTM network (encoder-decoder) where the decoder conditions on the representation generated by the encoder as well as the question tokens generated so far. Decoding is stopped when “END” token is produced.
+The paragraph may contain some named entities or rare words which do not appear in the softmax vocabulary. To account for such words, a copying mechanism is also incorporated.
+At each time step, a Pointer Network (CP) and a Vocabulary Predictor (VP) are used to generate probability distribution for the next word and a Latent Predictor Network is used to decide which of the two networks would be used for the prediction.
+At inference time, a greedy decoding is used where the most likely predictor is chosen and then the most likely word from that predictor is chosen.
+Data Regularization - There is a need to alternate between mini batches from source and target domain while fine-tuning the MC model.
+At inference time, the fine-tuned MC model is used to get the distribution P(i=start) and P(i=end) (corresponding to the likelihood of choosing word I as the starting or ending word for the answer) for all the words and DP is used to find the optimal answer span.
+Checkpoint Averaging - Use the different checkpointed models to average the answer likelihood before running DP.
+Using the synthetically generated dataset helps to gain a 2% improvement in terms of F-score (from SQuAD -> NewsQA). Using checkpointed models further improves the performance to overall 46.6% F score which closes the gap with respect to the performance of model trained on NewsQA itself (~52.3% F score)
+The paper presents a semi-supervised learning framework for graphs where the node embeddings are used to jointly predict both the class labels and neighbourhood context. Usually, graph embeddings are learnt in an unsupervised manner and can not leverage the supervising signal coming from the labelled data.
+The framework is called Planetoid (Predicting Labels And Neighbors with Embeddings Transductively Or Inductively from Data).
+Given a graph G = (V, E) and xL and xU as feature vectors for labelled and unlabelled nodes and yL as labels for the labelled nodes, the problem is to learn a mapping (classifier) f: x -> y
+There are two settings possible:
+ +Transductive - Predictions are made only for those nodes which are already observed in the graph at training time.
+Inductive - Predictions are made for nodes whether they have been observed in the graph at training time or not.
+The general semi-supervised learning loss would be LS + λLU where LS is the supervised learning loss while LU is the unsupervised learning loss.
+The unsupervised loss is a variant of the Skip-gram loss with negative edge sampling.
+More specifically, first a random walk sequence S is sampled. Then either a positive edge is sampled from S (within a given context distance) or a negative edge is sampled.
+The label information is injected by using the label as a context and minimising the distance between the positive edges (edges where the nodes have the same label) and maximising the distance between the negative edges (edges where the nodes have different labels).
+Two separate fully connected networks are applied over the node features and node embeddings.
+These 2 representations are then concatenated and fed to a softmax classifier to predict the class label.
+In the inductive setting, it is difficult to obtain the node embeddings at test time. One naive approach is to retrain the network to obtain the embeddings on the previously unobserved nodes but that is inefficient.
+The embeddings of node x are parameterized as a function of its input feature vector and is learnt by applying a fully connected neural network on the node feature vector.
+This provides a simple way to extend the original approach to the inductive setting.
+The proposed approach is evaluated in 3 settings (text classification, distantly supervised entity extraction and entity classification) and it consistently outperforms approaches that use just node features or node embeddings.
+The key takeaway is that the joint training in the semi-supervised setting has several benefits over the unsupervised setting and that using the graph context (in terms of node embeddings) is much more effective than using graph Laplacian-based regularization term.
+Unsupervised text embeddings can be generalized for different tasks but they have weaker predictive powers (as compared to end-to-end trained deep learning methods) for any particular task. But the deep learning techniques are expensive and need a large amount of supervised data and a large number of parameters to tune.
+The paper introduces Predictive Text Embedding (PTE) - a semi-supervised approach which learns an effective low dimensional representation using a large amount of unsupervised data and a small amount of supervised data.
+The work can be extended to general information networks as well as classic techniques like MDS, Iso-map, Laplacian EigenMaps etc do not scale well for large graphs.
+Further, this model can be applied to heterogeneous networks as well unlike the previous works LINE and DeepWalk which work on homogeneous networks only.
+The paper proposes 3 different kinds of networks:
+ +All 3 graphs are integrated into one heterogeneous text network.
+First, the authors extend their previous work, LINE, for heterogenous bipartite text networks as explained:
+ +Given a bipartite graph G = (VA \bigcup VB, E) , where VA and VB are disjoint set of vertices, the conditional probability of va (in set VA) being generated by vb (in set VB) is given as the softmax score between embeddings of va and vb and normalised by the sum of exponentials of dot products between vb and all nodes in VA.
+The second order proximity can be determined by the conditional distributions *p(. | +vj)*p(. | +vj)*. | +
The objective to be minimised the KL divergence between the conditional distribution p(.\vj) and the emperical distribution p^(.\vj) (given as wi, j/degj).
+Now, the 3 individual networks can all be interpreted as bipartite networks. So node representation of all the 3 individual networks is obtained as described above.
+For the word-label network, since the training data is sparse, one could either train the unlabelled networks first and then the labelled network or they all could be trained jointly.
+For the case of joint training, the edges are sampled from the 3 networks alternatively.
+For the fine-tuning case, the edges are first sampled from the unlabelled network and then from the labelled network.
+Once the word embeddings are obtained, the text embeddings may be obtained by simply averaging the word embeddings.
+Baseline Models
+ +For long documents, PTE (joint) outperforms CNN and other PTE variants and is around 10 times faster than CNN model.
+For short documents, PTE (joint) does not always outperform CNN model probably because the word sense ambiguity is more relevant in the short documents.
+In machine learning, it is common to train a single large model (with a large number of parameters) or ensemble of multiple smaller models using the same dataset.
+While such large models help to improve the performance of the system, they also make it difficult and computationally expensive to deploy the system.
+The paper proposes to transfer the knowledge from such “cumbersome” models into a single, “simpler” model which is more suitable for deployment. This transfer of knowledge is referred to as “distillation”.
+Train the cumbersome model using the given training data in the usual way.
+Train the simpler, distilled model using the class probabilities (from the cumbersome model) as the soft target. Thus, the simpler model is trained to generalise the same way as the cumbersome model.
+If the soft targets have high entropy, they provide much more information than the hard targets and the gradient (between training examples) would vary lesser.
+One approach is to minimise the L2 difference between logits produced by the cumbersome model and the simpler model. This approach was pursued by Buciluǎ et al.
+The paper proposes a more general solution which they name “distillation”. The temperature of the final softmax is increased till the cumbersome model produces a set of soft targets (from the final softmax layer). These soft targets are then used to train the simpler model.
+It also shows that the proposed approach is, in fact, a more general case of the first approach.
+In the simplest setting, the cumbersome model is first trained with a high value of temperature and then the same temperature value is used to train the simpler model. The temperature is set to 1 when making predictions using the simpler model.
+It helps to add an auxiliary objective function which corresponds to the cross-entropy loss with the correct labels. The second objective function should be given a much lower weight though. Further, the magnitude of the soft targets needs to be scaled by multiplying with the square of temperature.
+The paper reports favourable results for distillation task for the following domains:
+ +Image Classification (on MNIST dataset)
+ +Automatic Speech Recognition (ASR)
+ +An extra experiment is performed where the baseline model is trained using both hard targets and soft targets alternatively. Further, only 3% of the total dataset is used.
+The model using hard targets overfits and has poor test accuracy while the model using soft targets does not overfit and gets much better test accuracy. This shows the regularizing effect of soft targets.
+Training ensemble specialists for very large datasets (JFT dataset - an internal dataset at Google)
+ +The experiment shows that while training a single large model would take a lot of time, the performance of the model can be improved by learning a small number of specialised networks (which are faster to train).
+Though it is yet to be shown that the knowledge of such specialist models can be distilled back into a single model.
+When neural networks are trained on images, they tend to learn the same kind of features for the first layer (corresponding to Gabor filters or colour blobs). The first layer features are “general” irrespective of the task/optimizer etc.
+The final layer features tend to be “specific” in the sense that they strongly depend on the task.
+The paper studies the transition of generalization property across layers in the network. This could be useful in the domain of transfer learning where features are reused across tasks.
+Degree of generality of a set of features, learned on task A, is defined as the extent to which these features can be used for another task B.
+Randomly split 1000 ImageNet classes into 2 groups (corresponding to tasks A and B). Each group has 500 classes and half the total number of examples.
+Two 8-layer convolutional networks are trained on the two datasets and labelled as baseA and baseB respectively.
+Now choose a layer numbered n from {1, 2…7}.
+For each layer n, train the following two networks:
+ +If AnB performs well, nth layer features are “general”.
+In another setting, the transferred layers are also fine-tuned (BnB+ and AnB+).
+ImageNet dataset contains a hierarchy of classes which allow for creating the datasets A and B with high and low similarity.
+For n = {1, 2}, the performance of the BnB model is same as baseB model. For n = {3, 4, 5, 6}, the performance of BnB model is worse.
+This indicates the presence of “fragile co-adaption” features on successive layers where features interact with each other in a complex way and can not be easily separated across layers. This is more prominent across middle layers and less across the first and the last layers.
+For model AnB, the performance of baseB for n = {1, 2}. Beyond that, the performance begins to drop.
+Transfer learning of features followed by fine-tuning gives better results than training the network from scratch.
+Instead of using transferred weights in BnB and BnA, the first n layers were initialized randomly.
+The performance falls for layer 1 and 2. It further drops to near-random level for layers 3 and beyond.
+Another interesting insight is that even for dissimilar tasks, transferring features is better than using random features.
+Problem Statement: Given an image, answer a given question about the image.
+Assumptions:
+The paper proposes ECM (Emotional Chatting Machine) which can generate both semantically and emotionally appropriate responses in a dialogue setting.
+More specifically, given an input utterance or dialogue and the desired emotional category of the response, ECM is to generate an appropriate response that conforms to the given emotional category.
+Much of the recent, deep learning based work on conversational agents has focused on the use of encoder-decoder framework where the input utterance (given sequence of words) is mapped to a response utterance (target sequence of words). This is the so-called seq2seq family of models.
+ECM model can sit within this framework and introduces 3 new components:
+ +Loss function
+ +Metric
+ +ECM achieves a perplexity of 65.9 and emotional accuracy of 0.773.
+Based on human evaluations, ECM statistically outperforms the seq2seq baselines on both naturalness (likeliness of response being generated by a human) and emotion accuracy.
+Notes
+ +The paper describes a general purpose neural embedding model where different type of entities (described in terms of discrete features) are embedded in a common vector space.
+A similarity function is learnt to compare these entities in a meaningful way and score their similarity. The definition of the similarity function could depend on the downstream task where the embeddings are used.
+Each entity is described as a set of discrete features. For example, for the recommendation use case, the users may be described as a bag-of-words of movies they have liked. For the search use case, the document may be described as a bag-of-words of words they are made up of.
+Given a dataset and a task at hand, generate a set of positive samples E = (a, b) such that a is the input to the task (from the dataset) and b is the expected label(answer/entity) for the given task.
+Similarly, generate another set of negative samples E - = (a, bi-) such that bi- is one of the incorrect label(answer/entity) for the given task. The incorrect entity can be sampled randomly from the set of candidate entities. Multiple incorrect samples could be generated for each positive example. These incorrect samples are indexed using i.
+For example, in case of supervised learning problem like document classification, a would be one of the documents (probably described in terms of words), b is the correct label and bi-) is one of the randomly sampled label from set of all the labels (excluding the correct label).
+In case of collaborative filtering, a would be the user (either described as a discrete entity like a userid or in terms of items purchased so far), b is the next item the user purchases and bi-) is one of the randomly sampled item from the set of all the items.
+A similarity function is chosen to compare the representation of entities of type a and b. The paper considered cosine similarity and inner product and observed that cosine similarity works better for the case with a large number of entities.
+A loss function compares the similarity between positive pairs (a, b) and (a, bi-). The paper considered margin ranking loss and negative log loss of softmax and reported that margin ranking loss works better.
+The norm of embeddings is capped at 1.
+The same model architecture is applied to a variety of tasks including multi-class classification, multi-label classification, collaborative filtering, content-based recommendation, link prediction, information retrieval, word embeddings and sentence embeddings.
+The model provides a strong baseline on all the tasks and performs at par with much more complicated and task-specific networks.
+Sequence-to-Sequence models have made abstract summarization viable but they still suffer from issues like out of vocabulary words and repetitive sentences.
+The paper proposes to overcome these limitations by using a hybrid Pointer-Generator network (to copy words from the source text) and a coverage vector that keeps track of content that has already been summarized so as to discourage repetition.
+It is a hybrid model between the Sequence-to-Sequence network and Pointer Network such that when generating a word, the model decides whether the word would be generated using the softmax vocabulary (Sequence-to-Sequence) or using the source vocabulary (Pointer Network).
+Since the model can choose a word from the source vocabulary, the issue of out of vocabulary words is handled.
+The model maintains a coverage vector which is the sum of attention distributions over all previous decoder timesteps.
+This coverage vector is fed as an input to the attention mechanism.
+A coverage loss is added to prevent the model from repeatedly attending to the same word.
+The idea is to capture how much coverage different words have already received from the attention mechanism.
+Model when evaluated on CNN/Daily Mail summarization task, outperforms the state-of-the-art by at least 2 ROUGE points though it still does not outperform the lead-3 baseline.
+Lead-3 baseline uses first 3 sentences as the summary of the article which should be a strong baseline given that the dataset is actually about news articles.
+The model is initially trained without coverage and then finetuned with the coverage loss.
+During training, the model first learns how to copy words and then how to generate words (pgen starts from 0.3 and converges to 0.53).
+During testing, the model strongly prefers copying over generating (pgen = 0.17).
+Further, whenever the model is at beginning of sentences or at the join between switched-together fragments, it prefers to generate a word instead of copying one from the source language.
+The overall model is very simple, neat and interpretable and also performs well in practice.
+The paper presents Neural Relational Inference (NRI) model which can infer underlying interactions in a dynamical system in an unsupervised manner, using just the observational data in terms of the trajectories.
+For instance, consider a simulated system where the particles are connected to each other by springs. The observational data does not explicitly specify which particles are connected to each other and only contains information like position and velocity of each particle at different timesteps.
+The task is to explicitly infer the interaction structure (in this example, which pair of particles are connected to each other) while learning the dynamical model of the system itself.
+The model consists of an encoder that encodes the given trajectories into an interaction graph and a decoder that decodes the dynamical model given the interaction graph.
+The model starts by assuming that a full connected interaction graph exists between the objects in the system.
+For this latent graph z, zi, j denotes the (discrete) edge type between object vi and vj with the assumption that there are K edge types.
+The object vi has a feature vector xit associated with it at time t. This feature vector captures information like location and velocity.
+A Graph Neural Network (GNN) acts on the fully connected latent graph z, performs message passing from node to node via edges and predicts the discrete label for each edge.
+The GNN architecture may itself use MLPs or ConvNets and returns a factorised distribution over the edge types qφ(z|x).
+The decoder is another GNN (with separate params for each edge type) that predicts the future dynamics of the system and returns pθ(x|z).
+Eqφ(z|x)[log pθ(x|z)] − KL[qφ(z|x)||pθ(z)]
+pθ(x) is the prior which is assumed to be uniform distribution over the edge types.
+Instead of predicting the dynamics of the system for just the next timestep, the paper chooses to use the prediction multiple steps (10) in the future. This ensures that the interactions can have a significant effect on the dynamics of the system.
+Given the dynamical system, run the encoder to obtain qφ(z|x).
+Sample zi, j from qφ(z|x).
+Run the decoder to predict the future dynamics for the next T timesteps.
+Optimise the ELBO loss.
+Note that since the latent variables (edge labels) are discrete in this case, the sampling is done from a continuous approximation of the discrete distribution and reparameterization trick is applied over this discrete approximation to get the (biased) gradients.
+Experiments are performed using simulated systems like particles connected to springs, phase coupled oscillators and charged particles and using real-world data like CMU Motion Capture database and NBA tracking data.
+The NRI system effectively predicts the dynamics of the systems and is able to reconstruct the ground truth interaction graph (for simulated systems).
+The paper presents NeuroSAT, a message passing neural network that is trained to predict if a given SAT can be solved. As a side effect of training, the model also learns how to solve the SAT problem itself without any extra supervision.
+Given an expression in the propositional logic, the task is to predict if there exists a substitution of variables that make the expression true.
+The expression itself can be written as a conjunction of disjunctions (“and” over “or”) where each conjunct is called a clause and each variable within a clause is called a literal.
+Invariants
+ +The variables or clauses or literals (within the clauses) can be permuted.
+Every occurrence of a variable can be negated.
+Given the SAT problem, create an undirected graph of literals, their negations and the clauses they belong to.
+Put an edge between every literal and the clause to which it belongs and another kind of edge between every literal and its negation.
+Perform message passing between nodes to obtain vector representations corresponding to each node. Specifically, first, each clause received a message from its neighbours (literals) and updates its embeddings. Then every literal receives a message from its neighbours (both literals and clauses) and updates its embeddings.
+After T iterations, the nodes vote to decide the prediction of the model as a whole.
+The model is trained end-to-end using the cross-entropy loss between logit and the true label.
+Permutation invariance is ensured by operating on the nodes and the edges in the topological order and negation invariance is ensured by treating all literals as the same.
+The most interesting aspect of this work is that even though the model was trained to predict if the SAT problem can be satisfied, it is actually possible to extract the correct assignment from the classifier.
+In the early iterations, all the nodes vote “unsolvable” with low confidence. Then a few nodes start voting “solvable” and then a phase transition happens where most of the nodes start voting “solvable” with high confidence.
+The model never becomes highly confident that problem is “unsolvable” and almost never guesses “solvable” on an “unsolvable” problem. So in some sense, the model is looking for the combination of literals that actually solves the problem.
+The authors found that the 2 dimensional PCA projections of the literal embeddings are initially mixed up but become more and more linearly separable as the phase transition happens.
+Based on this insight, the authors propose to obtain cluster centres C1 and C2, partition the variables according to the cluster centres and then try assignments from both the partitions.
+This alone provides a satisfying solution in over 70% of the cases when though there is no explicit supervising signal about how to solve the problem.
+The other strengths of the paper includes
+ +Generalizing to longer and more difficult SAT problems (than those seen during training).
+Generalizing to another kind of search problems like graph colouring, clique detection etc (over small random graphs).
+The paper also reports that by adding supervising signal about which clauses in the given expression are unsatisfiable, it is possible to decode the literals which prove the “unsatisfiability” of an expression at test time. Though not a lot of details have been provided about this part and would probably be covered in the next iteration of the paper.
+Catastrophic Forgetting refers to the phenomenon where when a learning system is trained on two tasks in succession, it may forget how to perform the first task.
+The paper investigates this behaviour for different learning activations in presence and absence of dropout.
+For each experiment, two tasks are defined - “old” task and “new” task.
+The network is first trained on the “old” task until the validation set error has not improved for the last 100 epochs.
+The “best” performing model is then trained for the “new” task until the combined error on the “old” and the “new” validation datasets has not improved in the last 100 epochs.
+All the tasks used the same model architecture - 2 hidden layers followed by a softmax layer.
+Models were trained using SGD with or without dropout.
+For each combination of the model, activation and the training mechanism, a random hyper param search was performed with set of 25 hyperparams.
+In terms of the relationship between the “old” and the “new” tasks, three kinds of settings are considered:
+ +The tasks are very very similar but the input is processed in a different format. For this setting, MNIST dataset was used with a different permutation of pixels for the “old” and the “new” task.
+The tasks are similar but not exactly the same. For this setting, the task was to predict sentiments of reviews across 2 different product categories.
+In the last setting, 2 dissimilar tasks were used. One task was to predict sentiment of reviews and another task was to perform classification over MNIST dataset (reduced to 2 classes).
+Using Dropout improved the overall validation performance for all the models for all the tasks.
+Using Dropout also increase the size of the optimal model across all the activations indicating that maybe the increased size of the model could explain the increased resistance to forgetting. It would have been interesting to check if dropout always selected the largest model possible given the set of the hyperparams.
+On the dissimilar task, dropout improved the performance while reducing the model size so it might have other properties as well that helps to prevent forgetting.
+As compared to the choice of training technique, the activation function has a less consistent effect on resistance to forgetting. The paper recommends performing cross-validation for the choice of the activation function. If that is not feasible, maxout activation function with dropout could be used.
+Information Extraction - Given a query to be answered and an external search engine, information extraction entails the task of issuing search queries, extracting information from new sources and reconciling the extracted values till we are sufficiently confident about the extracted values.
+The paper proposes the use of Reinforcement Learning (RL) to solve this task.
+Information extraction task is modelled as a Markov Decision Process (MDP) <S, A, T, R>
+Deep Q Network is used.
+Conventional wisdom says that when training neural networks, learning rate should monotonically decrease. This insight forms the basis of the different type of adaptive learning rates.
+Counter to this expected behaviour, the paper demonstrates that using a cyclical learning rate (CLR), varying between a minimum and a maximum value, helps to train the neural network faster without requiring fine-tuning of learning rate.
+The paper also provides a simple approach to estimate the lower and upper bound for CLR.
+Difficulty in minimizing the loss arises from saddle points and not from local minima. [Ref]
+Increasing the learning rate allows for rapid traversal of saddle points.
+Alternatively, the optimal learning rate is expected to be between bounds of CLR and thus the learning rate would always be close to the optimal learning rate.
+Cycle Length = Number of iterations till learning rate returns to the initial value = 2 * step_size
+step_size should be set to 2-10 times the number of iterations in an epoch.
+Estimating the CLR boundary values:
+ +Run the model for several epochs while increasing the learning rate between the allowed low and high values.
+Plot accuracy vs learning rate and note the learning rate values when the accuracy starts to fall.
+This gives a good candidate value for upper and lower bound. Alternatively, the lower bound could be set to be 1/3 or 3/4 of the upper bound. But it is difficult to judge if the model has run for the sufficient number of epochs in the first place.
+Empirical evidence indicates that at training time, the neural networks need to be of significantly larger size than necessary.
+The paper purposes a hypothesis called the lottery ticket hypothesis to explain this behaviour.
+The idea is the following - Successful training of a neural network depends on a lucky random initialization of a subcomponent of the network. Such components are referred to as lottery tickets.
+Larger networks are more likely to have these lottery tickets and hence are easier to train.
+Various aspects of the hypothesis are explored empirically.
+Two tasks are considered - MNIST and XOR.
+For each task, the paper considers networks of different sizes and empirically shows that larger networks are more likely to converge (or have better performance) for a fixed number of epochs as compared to the smaller networks.
+Given a large, trained network, some weights (or units) of the network are pruned and the resulting network is reset to its initial random weights.
+The resulting network is the lottery-ticket in the sense that when the pruned network is trained, it is more likely to converge than an otherwise randomly initialised network of the same size. Further, it is more likely to match the original, larger network in terms of performance.
+The paper explores different aspects of this experiment:
+ +Size of the pruned network affects the speed of convergence when training the lottery ticket.
+If only the architecture or only the initial weights of the lottery ticket are used, the resulting network tends to converge more slowly and achieves a lower level of performance.
+The paper includes some more interesting experiments. For instance, the distribution of the initialization in the weights that survived the pruning suggests that small weights from before training tend to remain small after training.
+One interesting experiment would be to show the performance of the pruned network before resetting its weights and retraining again. This performance should be compared with the performance of the initial large network and the performance of the lottery ticket after training.
+Overall, the experiments are not sufficient to conclude anything about the correctness of the hypothesis. The proposition itself is very interesting and could enhance our understanding of how the neural networks work.
+Convolutional Neural Networks are extremely good feature extractors in the sense that features extracted for one task (say image classification) can be easily transferred to another task (say image segmentation).
+Existing unsupervised approaches do not aim to learn discriminative features and supervised approaches for discriminative features do not scale well.
+The paper presents an approach to learn features in an unsupervised setting by using a set of target representations called as Noise As Target (NAT) which acts as a kind of proxy supervising signal.
+In the setting of the problem where we are learning both the features and the target representation, a trivial solution would be the one where all the input images map to the same target and are assigned the same representation. No discriminative features are learned in this case.
+To avoid such situations, a set of k predefined target representations are chosen and each image is mapped to one of these k representations (based on the features).
+There is an assumption that k > n so that each image is assigned a different target.
+One simple choice of target representation is the standard one-hot vector which implies that all the class (and by extension, the associated images) are orthogonal and equidistant from each other. But this is not a reasonable approximation as not all the image pairs are equally similar or dissimilar.
+Instead, the target vectors are uniformly sampled from a d-dimensional unit sphere, where d is the dimensionality of the feature representation. That is, the idea is to map the features to the manifold of the d-dimensional L2 sphere by using the K predefined representations as for the discrete approximation of the manifold.
+Since each data point (image) is mapped to a new point on the manifold, the algorithm is suited for online training as well.
+For the training, the number of target K is reduced to the number of images n and an assignment matrix P is learned which ensures that the mapping between the image to target is 1-to-1.
+The resulting optimisation equation can be solved using the Hungarian Algorithm but at a high-cost O(n^3). An optimisation is to take a batch of b images and update the square matrix PB for dimension bXb (made of the images and their corresponding targets). This reduces the overall complexity of O(nb^2).
+Other optimisation techniques, that are common to supervised learning, like batch norm used in this setting as well.
+Used AlexNet with NATs to train the unsupervised model.
+An MLP is trained on these features to learn the classifier.
+Standard preprocessing techniques like random cropping/flipping are used.
+Dataset
+ +ImageNet for training the AlexNet architecture with the proposed approach.
+Pascal VOC 2007 for transfer learning experiments.
+Baselines
+ +Unsupervised approaches like autoencoder, GAN, BiGAN
+Self-supervised
+SOTA models using hand-made features SIFT with Fisher Vector.
+Using squared loss instead of softmax does not deteriorate the performance too much.
+The authors compare the effect of using discrete vs continuous target representations for transfer learning. For the discrete representation, elements of the canonical basis of a k-dimensional space (k=1000, 10000, 100000) are used. Experiments demonstrate that d-dimensional continuous vectors perform much better than the discrete vectors.
+While training the unsupervised network, its features were extracted after every 20 iterations to evaluate the performance on transfer learning task. The test accuracy increases up to around 100 iterations then saturate.
+Comparing the visualization of the first convolutional layer filters (for AlexNet with and without supervision) shows that while unsupervised filters are less sharp, they maintain the edge and orientation information.
+The proposed unsupervised method outperforms all the unsupervised baselines and is competitive with respect to the supervised baseline. But it is still far behind the model using handcrafted features.
+For transfer learning, on Pascal VOC, the proposed approach beats the supervised baseline and works at par with the supervised approach.
+The paper proposed a simple unsupervised framework for learning discriminative features without having to rely on proxy tasks like image generation and without having to make an assumption about the input domain.
+The key aspect of the proposed approach is that each image is assigned to a unique point in the d-dimensional manifold which means 2 images could be very close to each other on the manifold while being quite distinct in reality. It is interesting to see that such a simple strategy is able to give such good results.
+The paper presents a general message passing architecture called as Message Passing Neural Networks (MPNNs) that unify various existing models for performing supervised learning on molecules.
+Variants of the MPNN model achieve very good performance on the task of predicting the property of the molecules.
+The input to the model is an undirected graph G where node features are represented as xv (corresponding to node v) and edge features are ev, w (corresponding to edge between nodes v, w).
+The idea is to learn a representation (or feature vector) for all the nodes (and possibly edges) in the graph and use that for the downstream supervised learning task.
+The model can be easily extended to the setting of directed graphs.
+The model works in 2 phases:
+All nodes send a message to their neighbouring nodes. The message is a function of the feature vectors corresponding to the sender node (or vertex), the receiver node and the edge connecting the two nodes. The feature vectors can be combined to form the message using the message function which can be implemented as a neural network.
+Once a node has received messages from all its neighbours, it updated its feature vector by aggregating all the message. The function used to aggregate and update the feature vector is called as the update function and can be implemented as a neural network.
+After updating the feature vectors, the graph could initiate another round of message passing. After a sufficient number of message passing rounds, the Readout phase is invoked.
+The feature vectors corresponding to different nodes in the graph are aggregated into a single feature vector (corresponding to the feature vector of the graph) using the readout function.
+The readout function can also be implemented using a neural network with the condition that it is invariant to the permutation of the nodes within the graph (to ensure that the MPNN is independent of the graph isomorphism).
+Broadly speaking, the task is to predict the properties of given molecules (regression problem).
+The QM9 dataset consists of 130K molecules whose properties have been measured using Quantum Mechanical Simulations (DFT).
+Properties to be predicted include atomization energy, enthalpy, highest fundamental vibrational frequency etc.
+There are two benchmarks for error:
+ +DFT Error - Estimated average error of DFT approximation
+Chemical Accuracy - As established by the chemistry community
+Following variants of message function are explored:
+ +Matrix multiplication between Aevw and hv where A is the adjacency matrix hv is the feature corresponding to node v.
+Edge Network which is same as matrix multiplication case with the difference that A is a learned matrix for each edge type.
+Pair Network where the feature vector corresponding to the source node, target node and edge is fed to a neural network.
+Since all messages are shared via edges, it could take a long time for the message to move between two ends of the graph. To fasten this process, virtual elements are provided.
+In the first setting, “virtual edges” are inserted between nodes.
+In the second setting, a “master” node connects to all the other nodes.
+In a graph with n nodes and d dimensional feature vectors, a single step of message passing would have the worst case time complexity of O(n2d2.
+This complexity can be reduced by breaking the d dimensional embedding into k different groups of d/k embeddings which can be updated in parallel. The complexity of the modified approach is O(n2d2/k.
+Best performing MPNN model uses edge network as the message function and set2set as the readout function.
+Using group of embeddings helps to improve generalization. This effect could also be because of ensemble-like nature of the modified architecture.
+The model performs worse without the virtual elements.
+Long range interaction between vertices is necessary.
+Scaling to larger molecule sizes is challenging because the model creates a fully connected graph by incorporating virtual elements.
+Additionally, we need to ensure that any modification in the architecture, to enhance the performance on the counting questions, should not degrade the performance on other classes of questions.
+The paper proposes to overcome these challenges by using the attention maps (and not the aggregated feature vectors) as input to a separate count module.
+The basic idea is quite intuitive: when we perform weighted averaging based on different attention maps, we end up averaging the features corresponding to the difference instances of an object. This makes the feature vectors indistinguishable from the scenario where we had just one instance of the object in the image.
+ +Even multiple glimpses (multiple attention steps) can not resolve this problem as the weights given to one feature vector would not depend on the other feature vectors (that are attended to). Hard attention could be more useful than soft-attention but there is not much empirical evidence in support of this hypothesis.
+ +The proposed count module is a separate pipeline that can be integrated with most of the existing attention based VQA models without affecting the performance on non-count based questions.
+ +The inputs to the count module are the attention maps and the object proposals (coming from some pre-trained model like the RCNN model) and the output is an count-feature vector which is used to answer the count based question.
+ +The top level idea is the following - given the object proposals and the attention maps, create a graph where nodes are objects (object proposals) and edges capture how similar two object proposals are (how much do they overlap). The graph is transformed (by removing and scaling edges) so that the count of the object can be obtained easily.
+ +To explain their methodology, the paper simplifies the setting by making two assumptions:
+These simplifying assumptions are made only for the sake of exposition and do not limit the capabilities of the count module.
+ +Given the assumptions, the task of the count module is to handle the exact duplicates to prevent double-counting of objects.
+ +As the first step, the attention weights (a) are used to generate an attention matrix (A) by performing an outer product between a and aT. This corresponds to the step of creating a graph from the input.
+ +A corresponds to the adjacency matrix of that graph. The attention weight for the ith proposal corresponds to the ith node in the graph and the edge between the nodes i and j has the weight ai*aj.
+ +Also note that the graph is a weighted directed graph and the subgraph of vertices satisfying the condition ai = 1 is a complete directed graph with self-loops. Given such a graph, the number of vertices, V = sqrt(E) where E could be computed by summing over the adjacency matrix.This implies that if the proposals are distinct, then the count can be obtained trivially by performing a sum over the adjacency matrix.
+ +The objective is now to eliminate the edges such that the underlying objects are the vertices of a complete subgraph. This requires removing two type of duplicate edges - intra-object edges and inter-object edges.
+ +Intra-object edges can be removed by computing a distance matrix, D, defined as 1 - IoU, where IoU matrix corresponds to the Intersection-over-Union matrix. A modified adjacency matrix A’ is obtained by performing the element-wise product between f1(A) and f2(D) where f1 and f2 are piece-wise linear functions that are learnt via backpropogation.
+ +The inter-object edges are removed in the following manner:
+ +s can be converted into a matrix (by doing outer-product with itself) so as to scale both the incoming and the outgoing edges. The self edges (which were removed while computing A’ are added back (after scaling with s) to obtain a new transformed matrix C.
+ +The transformed matrix C is a complete graph with self-loops where the nodes corresponds to all the relevant object instances and not to object proposals. The actual count can be obtained from C by performing a sum over all its values as described earlier. The original count problem was a regression problem but it is transformed into a classification problem to avoid scale issues. The network produces a k-hot n-dimensional vector called o where n is the number of object proposals that were feed into the module (and hence the upper limit on upto how large a number could the module count). In the ideal setting, k should be one, as the network would produce an integer value but in practice, the network produces a real number so k can be upto 2. If c is an exact integer, the output is a 1-hot vector with the value in index corresponding to c set to 1. If c is a real number, the output is a linear interpolation between two one-hot vectors (the one-hot vectors correspond to the two integers between which c lies).
+ +count module supports computing the confidence of a prediction by defining two variables pa and pD which compute the average distance of f6(a) and $f7(D) from 0.5. The final output o’ is defined as f8(pa + pD) . o
+ +All the different f functions are piece wise linear functions and are learnt via backpropagation.
+ +The authors created a new category of count-based questions by filtering the number-type questions to remove questions like “What is the time right now”. These questions do have a neumerical answer but do not fall under the purview of count based questions and hence are not targeted by the count model.
+ +The authors augmented a state of the art VQA model with their count module and show substantial gains over the count-type questions for the VQA-v2 dataset. This augmentation does not drastically impact the performance on non-count questions.
+ +The overall idea is quite crisp and intutive and the paper is easy to follow. It would be even better if there were some more abalation studies. For example, why are the piece-wise linear functions assumed to have 16 linear components? Would a smaller or larger number be better?
diff --git a/_site/site/2018/05/21/Net2Net-Accelerating-Learning-via-Knowledge-Transfer.html b/_site/site/2018/05/21/Net2Net-Accelerating-Learning-via-Knowledge-Transfer.html new file mode 100644 index 00000000..3c404a37 --- /dev/null +++ b/_site/site/2018/05/21/Net2Net-Accelerating-Learning-via-Knowledge-Transfer.html @@ -0,0 +1,69 @@ +The paper presents a simple yet effective approach for transferring knowledge from a trained neural network (referred to as the teacher network) to a large, untrained neural network (referred to as the student network).
+The key idea is to use a function-preserving transformation that guarantees that for any given input, the output from the teacher network and the newly created student network would be the same.
+The approach works as follows - Let us say that the teacher network was represented by the transformation y = f(x, θ) where θ refer to the parameters of the network. The task is to choose a new set of parameters θ’ for the student network g(x, θ’) such that for all x, f(x, θ) = g(x, θ’)
+To start, we can assume that f and g are composed of standard linear layers. Layer i and i+1 are represented by weights Wmxni and Wnxpi+1
+We want to grow layer i to have q output units (where q > n) and layer i+1 to have q input units. The new weight matrix would be Umxqi and Uqxpi+1
+The first q columns (rows) of Wi (Wi+1) would be copied as it is into Ui(Ui+1).
+For filling the remaining n-q slots, columns (rows) would be sampled randomly from Wi (Wi+1).
+Finally, each layer in Ui is scaled by dividing by the corresponding replication factor to ensure that the output value of function remains unchanged by the operation.
+Since convolutions can be seen as multiplication by a double block circulant matrix, the approach can be readily extended for convolutional networks.
+The benefits of using this approach are the following:
+ +The variant discussed above is called the Net2WiderNet variant. There is another variant calledNet2DeeperNet that enables the network to grow in depth.
+In that case, a new matrix, U, initialized as the identity matrix, is added to the network. Note that unlike the Net2WiderNet, this approach would not work with arbitrary activation function between the layers.
+The model can accelerate the training of neural networks, especially during development cycle when the designers try out different models.
+The approach could potentially be used in life-long learning systems where the model is trained over a stream of data and needs to grow over time.
+The paper explores knowledge distillation (KD) from the perspective of transferring knowledge between 2 networks of identical capacity.
+This is in contrast to much of the previous work in KD which has focused on transferring knowledge from a larger network to a smaller network.
+The paper reports that these Born Again Networks (BANs) outperform their teachers by significant margins in many cases.
+The paper augments this setting with an extra cross-entropy loss between the output of the teacher and the student networks. The student tried to predict the correct answer while matching the output distribution of the teacher.
+The resulting student network is referred to as BAN - Born Again Network.
+The same approach can be used multiple times (with diminishing returns) where the kth generation student is initialized by knowledge transfer from (k-1)th generation student.
+Hinton et al suggested that even when the output of the teacher network is incorrect, it contains useful information about the similarity between the output classes. This information is referred to as the “dark knowledge”.
+The current paper observed that the gradient of the correct output dimension during distillation and normal supervised training resembles the original gradient up to a weight factor. This sample specific weight is defined by the value of the teacher’s max output.
+This suggests distillation may be performing some kind of importance weighing. To explore this further, the paper considers 2 cases:
+ +Confidence Weighted By Teacher Max (CWTM) - where each example in the student’s loss function is weighted by the confidence that the teacher has on the prediction for that sample. The student incurs a higher loss if the teacher was more confident about the example.
+Dark Knowledge with Permuted Predictions (DKPP) - The non-argmax output of teacher’s predictive distribution are permuted thus destroying the information about which output classes are related.
+The key effect of these variations is that the covariance between the output classes is lost and classical knowledge distillation would not be sufficient to explain improvements (if any).
+Standard Deep Learning networks are not suitable for continual learning setting as the change in the data distribution leads to catastrophic forgetting.
+The paper proposes Memory-based Parameter Adaptation (MbPA), a technique that augments a standard neural network with an episodic memory (containing examples from the previous tasks).
+This episodic memory allows for rapid acquisition of new knowledge (corresponding to the current task) while preserving performance on the previous tasks.
+MbPA consists of 3 components:
+ +f and g are parametric components while M is a non-parametric component.
+M is a dynamically sized dictionary where the key represents the output of the embedding network and the value represents the desired output for a given input (input to the model).
+When a new training tuple (xj, yj) is fed as input to the model, a key-value pair (hj, vj) is added to the memory. hj = f(xj)
+The memory has a fixed size and acts as a circular buffer. When it gets filled up, earlier examples are dropped.
+When accessing the memory using a key hkey, the k-nearest neighbours (in terms of distance from the given key) are retrieved.
+During testing, the memory is used to adapt the parameters of the output network g while the embedding network f remains the same.
+Given the input x, obtain the embedding corresponding to x and using that as the key, retrieve the k-nearest neighbours from the memory.
+Each retrived neighbour is a tuple of the form (hk, vk, wk) where wk is propotional to the closeness between the input query and the key corresponding to the retrived example.
+The collection of all the retrieved examples are referred to as the context C.
+The parameters of the output network g are adapted from θ to θx where θx = θ + δM(x, θ)
+δM(x, θ) is referred to as the contextual update of parameters of the output network.
+MbPA can be interpreted as decreasing the weighted average of negative log likelihood over the retrieved neighbours in the context C.
+The expression corresponding to δM(x, θ) can be obtained by performing gradient descent to minimise the max a posterior over the context C.
+The a posterior expression can be written as a sum of two terms - one corresponding to a weighted likelihood of data in the context C and the other corresponding to a regularisation term to prevent overfitting the data.
+This idea can be thought of as a generalisation of attention. Attention can be viewed as fitting a constant function over the neighbourhood of memories while MbPA fits a more general function which is parameterised by the output network of the given model. Refer appendix E in the paper for further details.
+MbPA aims to solve the fundamental problem of enabling the model to deal with changes in data distribution.
+In that sense, it is evaluated on a wide range of settings: continual learning, incremental learning, unbalanced datasets and change in data distribution at test time.
+Continual Learning:
+ +In this setting, the model encounters a sequence of tasks and cannot revisit a previous task.
+Permuted MNIST dataset was used.
+The key takeaway is that once a task is catastrophically forgotten, only a few gradient updates on a carefully selected data, are sufficient to recover the performance.
+Incremental Learning:
+ +In this setting, the model is trained on a subset of classes and then introduced to novel, unseen classes. The model is tested to see if it can incorporate the new knowledge while retaining the knowledge about the previous classes.
+Imagenet dataset with Resnet V1 model is used. It is first pretrained on 500 classes and then fine-tuned to see how quickly could it adapt to new classes.
+Unbalanced Dataset:
+ +Language Modelling:
+ +MbPA exhibits strong performance on all these tasks showing that the memory-based parameter adaption technique is effective across a range of tasks in supervised learning.
+The paper presents a very interesting approach for learning independent (inverse) data transformation from a set of transformed data points in an unsupervised manner.
+We start with a given data distribution P (say the MNIST dataset) where each x ε Rd.
+Consider N transformations M1, …, MN (functions that map input x to transformed input x’). Note that N need not be known before hand.
+These transformations can be thought of as independent (from other transformations) causal mechanisms.
+Applying these transformation would give N new distributions Q1, …, QN.
+These individual distributions are combined to form a single transformed distribution Q which contains the union of samples from the individual distributions.
+At training time, two datasets are created. One dataset corresponds to untransformed objects (sampled from P), referred to as DP. The other dataset corresponds to samples from the transformed distribution Q and is referred to as DQ.
+Note that all the samples in DP and DQ are sampled independently and no supervising information is needed.
+A series of N’ parametric models, called as experts, are initialized and would be trained to learn the different mechanisms.
+For simplicity, assume that N = N’. If N > N’, some experts would learn more than one transformation or certain transformations would not be learnt. If N < N’, some experts would not learn anything or some experts would learn the same distribution. All of these cases can be diagnosed and corrected by changing the number of experts.
+The experts are trained with the goal of maximizing an objective parameter c: Rd to R. c takes high values on the support of P and low values outside.
+During training, an example xQ (from DQ) is fed to all the experts at the same time. Each expert produces a value cj = c(Ej(xQ))
+The winning expert is the one whose output is the max among all the outputs. Its parameters are updated to maximise its output while the other experts are not updated.
+This forces the best performing model to become even better and hence specialize.
+The objective c comes from adversarial training where a discriminator network discriminates between the untransformed input and the output of the experts.
+Each expert can be thought of as a GAN that conditions on the input xQ (and not on a noise vector). The output of the different experts is fed to the discriminator which provides both a selection mechanism and the gradients for training the experts.
+Experiments are performed on the MNIST dataset using the transformations like translation along 4 directions and along 4 diagonals, contrast shift and inversion.
+The discriminator is further trained against the output of all the losing experts thereby furthering strengthing the winning expert.
+The experts are initialized randomly and then pretrained to approximate the identity function by training with identical input-output pairs.
+This ensures that the experts start from a similar level.
+In practice, it seems necessary for the success of the proposed approach.
+During the initial phase, there is a heavy competition between the experts and eventually different winners emerge for different transformations.
+The experiments are quite limited in terms of complexity of dataset and complexity of transformation but it provides evidence for a promising connection between deep learning and causality.
+Appendix mentions that in case there are too many experts, for most of the tasks, only one model specialises and the extra experts do not specialize at all. This is interesting as there is no explicit regularisation penalty which prevents the emergence of multiple experts per task.
+Recurrent Neural Networks have two key issues:
+ +Over parameterization which increases the time for training and inference.
+Ill conditioned recurrent weight matrix which makes training difficult due to vanishing or exploding gradients.
+The paper presents a flexible RNN model called as KRU (Kronecker Recurrent Units) which overcomes the above problems by using a Kronecker factored recurrent matrix and soft unitary constraints on the factors.
+Low-rank decomposition.
+Training a neural network on the soft targets predicted by a big pre-trained network.
+Low-bit precision training.
+Hashing.
+Gating mechanism like in LSTMs.
+Gradient Clipping.
+Orthogonal Weight Initialization.
+Parameterizing recurrent weight matrix.
+Uses a Kronecker factored recurrent matrix which enables controlling the number of parameters and number of factor matrices.
+Vanishing and exploding gradients are taken care of by using a soft unitary constraint.
+Why not use strict unitary constraint:
+ +Restricts the search space and makes learning process unstable.
+ +Makes forgetting (irrelevant) information difficult.
+Relaxing the strict constraint has shown to improve the convergence speed and generalization performance.
+KRU can be easily plugged into RNNs, LSTMs and other variants.
+The recurrent matrix W is paramterized as a kronecker product of F matrices W0, …, WF-1 where each Wf is a complex matrix of shape Pf x Qf and the product of all Pf and producto of all Qf are both equal to N.
+Why is W a complex matrix?
+ +In the real space, the set of all unitary matrices have the determinant as 1 or -1.
+ +Given that determinant is a continuous function, the unitary set in the real space is disconnected.
+The unitary set in the complex space is connected as its determinants are points on the unit circle.
+A soft unitary constraint is introduced in the form of regularization term | ++ | WfHWf - I | ++ | 2 (per kronecker factored recurrent matrix). | +
If each of the Kronecker factors is unitary, the resulting matrix W would also be unitary.
+It is computationally inefficient to apply this constraint over the recurrent matrix W itself as the complexity of the regularizer is given as O(N3).
+The Kronecker recurrent model is compared against the existing recurrent models for multiple tasks including copy memory, adding memory, pixel-by-pixel MNIST, char level language models, polyphonic music modelling, and framewise phoneme classification.
+For most of the task, KRU model produces results comparable to the best performing models despite using fewer parameters.
+Using soft unitary constraints in KRU provides a principled alternative to gradient clipping (a common heuristic to avoid exploding gradients).
+Further, recent theoretical results suggest the gradient descent converges to a global optimizer of linear recurrent networks even if the learning problem is non-convex provided that the spectral norm of the recurrent matrix is bound by 1.
+The key take away from the paper is that state should be high dimensional so that high capacity network can be used for encoding and decoding the input and output. The recurrent dynamics should be implemented via a low capacity model.s per task.
+The paper presents I2A (Imagination Augmented Agent) that combines the model-based and model-free approaches leading to data efficiency and robustness even with imperfect models.
+I2A agent uses the predictions from a learned environment model as an additional context in deep policy networks. This leads to improved data efficiency and robustness to imperfect models.
+I2A agent has two main modules - Imagination module and the Policy module.
+Imagination Module
+ +Policy Module
+ +Most existing GNN (Graph Neural Network) methods are inherently flat and are unable to process the information in a hierarchical manner.
+The paper proposes a differentiable graph pooling operation, DIFFPOOL, that can generate hierarchical graph representations and can be easily plugged into many GNN architectures.
+CNNs have spatial pooling operation that allows for deep CNN architectures to operate on coarse graph representations of input images.
+This notion cannot be applied as-is to graphs as they do not have a natural notion of spatial locality like images do.
+DIFFPOOL attempts to resolve this problem by learning a differentiable soft-assignment at each layer which is equivalent to pooling the cluster of nodes to obtain a sparse representation.
+Given a graph G(A, F), where A is the adjacency matrix and F is the feature matrix.
+Given a permutation invariant GNN that follows the message passing architecture. The output of this GNN can be expressed as Z = GNN(A, X) where X is the current feature matrix.
+Goal is to stack L GNN layers on top of each other such that the lth layer uses coarsened output from the (l-1)th layer.
+This coarsening operation uses a cluster assignment matrix S.
+The learned cluster assignment matrix at layer l is denoted at Sl
+Given Sl, the embedding matrix for the (l+1)th layer is given as transpose(Sl)Zl and adjancecy matrix is given by transpose(Sl)AlSl
+A new GNN, called as GNNpool is used to produce the assignment matrix S by taking a softmax over GNNpool(Al, Xl)
+As long as the GNN model is permutation invariant, the resulting DIFFPOOL model is also permutation invariant.
+The paper uses 2 auxiliary losses to push the model away from spurious local minima early in the training.
+Link prediction objective - at each layer, link prediction loss ( = A - S(transpose(S))) is minimized with the intuition that the nearby nodes should be pooled together.
+Ideally, the cluster assignment for each node should be a one-hot vector so the entropy for cluster assignment per node is regularized.
+DiffPool obtains the highest average performance across all the pooling approaches and improves upon the base GraphSage architecture by an average of around 7%.
+In terms of runtime complexity, the paper reports that DiffPool does not incur any significant additional running time. But given that now there are 2 GNN models per layer, the size of the model should increase.
+DiffPool can capture hierarchical community structure even when trained on just the graph classification loss.
+One advantage of DiffPool is that the nodes are pooled in a non-uniform way so densely connected group of nodes would collapse into one cluster while sparsely connected nodes can retain their identity.
+The paper proposes an approach for using symbolic knowledge in deep learning systems. These constraints are often expressed as boolean constraints on the output of the deep learning system and directly incorporating these constraints break the differentiability of the system.
+The model is given some input data to perform predictions and symbolic knowledge is provided in form of boolean constraints like exactly-one constraint for one-hot output encoding.
+Most approaches tend to encode the symbolic knowledge in the vector space embedding to keep the model pipeline differentiable. In this process, the precise meaning of symbolic knowledge is often lost.
+A differentiable “semantic loss” is derived which captures the meaning of the constraint while being independent of its syntax.
+A state x (state refers to the instantiation of boolean variables) satisfies a sentence a if a evaluates to true when using the variables as specified by x.
+A sentence a entails another sentence b if all states that satisfy a also satisfy b.
+The row output vector of the neural network is denoted as p where each value in p denotes the probability of an output.
+Three different output constraints are studied:
+ +Exactly-one constraint
+ +The semantic loss Ls(a, p) is a function of a propositional logic sentence a (the symbolic knowldge constraint) and p (output of the neural network).
+a is defined over variables (x1, …, xn) and p is interpreted as a vector of probabilities corresponding to these variables xi’s.
+The semantic loss is directly proportional to the negative log likelihood of generating a state that satisfies the constraints when sampling values according to the distribution p.
+Probabilities of variables that are not part of the constraint, do not affect the semantic loss.
+Semantic Loss is used in the semi-supervised setting for Permuted MNIST, Fashion MNIST and CIFAR-10.
+The key takeaway is that using semantic loss improves the performance of the state-of-the-art models for Fashion MNIST and CIFAR-10.
+One downside is that the effectiveness of the semantic loss in this type of constraint strongly depends on the performance of the underlying model. Further, the semantic loss does not improve the performance in case of fully supervised scenario.
+Further experiments are performed to evaluate the performance of the semantic loss on complex constraints. Since these tasks aim to highlight the effect of using semantic loss, only simple models (MLPs) are evaluated.
+The semantic loss is similar to the automated reasoning task called as weight model counting (wmc).
+Circuit compiler techniques can be used to compute wmc while allowing backpropagation.
+The paper provides a multi-agent learning environment and proposes a learning approach that facilitates the emergence of a basic compositional language.
+The language is quite rudimentary and is essentially a sequence of abstract discrete symbols. But it does comprise of a defined vocabulary and syntax.
+Cooperative, partially observable Markov game (multi-agent extension of MDP).
+All agents have identical action and observation spaces, use the same policy and receive a shared reward.
+Physically simulated 2-D environment in continuous space and discrete time with N agents and M landmarks.
+The agents and the landmarks would occupy some location and would have some attributes (colour, shape).
+Within the environment, the agents can go to a location, look at a location or do nothing. Additionally, they can utter communication symbols c (from a shared vocabulary C). Agents themselves learn to assign a meaning to the symbols.
+Each agent has an internal goal (which could require interaction with other agents to complete) which the other agents cannot see.
+Goal for agent i consists of an action to perform, a landmark location where to perform the action and another agent who should be performing the action.
+Since the agent is continuously emitting symbols, a memory module is provided and simple additive memory updates are done.
+For interaction, the agents could use verbal utterances, non-verbal signals (gaze) or non-communicative strategies (pushing other agents).
+A model of all agent and environment state dynamics is created over time and the return gradient is computed.
+Gumbel-Softmax distribution is used to obtain categorical word emission c.
+A multi-layer perceptron is used to model the policy which returns action, communication symbol and the memory update for each agent.
+Since the number of agents (and hence the number of communication streams etc) can vary across instantiations, an identical model is instantiated per agent and per communication stream.
+The output of individual processing modules are pooled into feature vectors corresponding to communication and physical observations. These pooled features and the goal vectors are fed to the final processing module from which actions and categorical symbols are sampled.
+In practice, using an additional task (each agent predicts the goal for another agent) encouraged more meaningful communication utterances.
+Authors recommend using a large vocabulary with a soft penalty that discourages use of too many words. This leads to use of a large vocabulary in the intermediate state which converges to a small vocabulary.
+Along the lines of rich gets richer dynamics, the communication symbol c’s are modelled as being generated by a Dirichlet process. The resulting reward across all agents is the log-likelihood of all communication utterances to have been generated by a Dirichlet process.
+Since the agents can only communicate in discrete symbols and do not have a global positioning reference, they need to unambiguously communicate landmark references to other agents.
+Non-verbal communication is not possible.
+When trained with just 2 agents, symbols are assigned for each landmark and action.
+As the number of agents is increased, additional symbols are used to refer to agents.
+If the agents of the same colour are asked to perform conflicting tasks, they perform the average of conflicting tasks. If distractor locations are added, the agents learn to ignore them.
+Agents are allowed to observe other agents’ position, gaze etc.
+Now the location can be pointed to using gaze.
+If gaze is disabled, the agent could indicate the goal landmark by moving to it.
+Basically even when the communication is disabled the agents can come up with strategies to complete the task.
+Environment for learning using modalities like vision, audio, semantics, physics and interaction with objects and other agents.
+Humans learn by interacting with their surroundings (environment).
+Similarly training an agent in an interactive multi-model environment (virtual embodiment) could be useful for a learning agent.
+Open-source and Open-AI gym compatible
+Built on top of 45000 3D house layouts from SUNCG dataset.
+Provides both 3D visual and audio recording.
+Semantic image segmentation and langauge description of objects.
+Rendering Engine
+ +Implemented using Panda 3D game engine.
+Renders RGB+depth scenes based on textures, multi-source lightings and shadows.
+Acoustic Engine
+ +Implemented using EVERT
+Supports multiple microphones, sound sources, sound absorption based on material, atmospheric conditions etc.
+Semantics Engine
+ +Physics Engine
+ +Implemented using Bullet3 Engine
+Supports physical interaction, external forces like gravity and position and velocity information for multiple agents.
+Visual Question Answering
+Conversational Agents
+Training an agent to follow instructions
+Multi-agent communication
+The paper explores “if a well behaved RNN can be replaced by a feed-forward network of comparable size without loss in performance.”
+“Well behaved” is defined in terms of control-theoretic notion of stability. This roughly requires that the gradients do not explode over time.
+The paper shows that under the stability assumption, feedforward networks can approximate RNNs for both training and inference. The results are empirically validated as well.
+Consider a general, non linear dynamical system given by a differential state transition map Φw. The hidden ht = Φw(ht-1, xt).
+Assumptions:
+ +Stable models are the ones where Φ is contractive ie Φw(h, x) - Φw(h’, x) is less than Λ * (h - h’)
+For example, in RNN, stability would require that norm(w) is less than (Lp)-1 where Lp is the Lipschitz constant of the point-wise non linearity used.
+The feedforward approximation uses a finite context (of length k) and is a truncated model.
+A non-parametric function f maps the output of the recurrent model to prediction. If f is desired to be a parametric model, its parameters can be pushed to the recurrent model.
+For a Λ-contractive system, it can be proved that for a large k (and additional Lipschitz assumptions) the difference in prediction between the recurrent and truncated mode is negligible.
+If the recurrent model and truncated feed-forward network are initialized at the same point and trained over the same input for N-step, then for an optimal k, the weights of the two models would be very close in the Euclidean space. It can be shown that this small difference does not lead to large gradient differences during subsequent update steps.
+This can be roughly interpreted as - if the gradient descent can train a stable recurrent network, it can also train a feedforward model and vice-versa.
+The stability condition is important as, without that, truncated models would be bad (even for large values of k). Further, it is difficult to show that gradient descent converges to a stationary point.
+Much of the work in representation leaning uses Euclidean vector spaces to embed datapoints (like words, nodes, entities etc).
+This approach is not effective when data has a (latent) hierarchical structure.
+The paper proposes to compute the embeddings in the hyperbolic space so as to preserve both the similarity and structure information.
+Hyperbolic spaces are spaces with a constant negative curvature while Euclidean spaces have zero curvature.
+The hyperbolic disc area and circle length increase exponentially with the radius r while in Euclidean space, it increases quadratically and linearly respectively.
+This makes the hyperbolic space more suitable for embedding tree-like structures where the number of nodes increases as we move away from the root.
+Hyperbolic spaces can be thought of as the continuous version of trees and trees can be thought of as the discrete version of hyperbolic spaces.
+Poincare model is one of the several possible models of the hyperbolic space and is considered here as it is more amenable to gradient-based optimisation.
+Distance between 2 pints change smoothly and is symmetric. Thus the hierarchical organisation only depends on the distance from the origin which makes the model applicable in settings where the hierarchical structure needs to be inferred from the data.
+Eventually the norm of a point represents its hierarchy and distance between the points represents similarity.
+Embedding taxonomy for wordnet task
+ +Setup
+ +The input data is a collection of a pair of words (u, v) which are related to each other.
+For each word pair, 10 negative samples of the form (u, v’) are sampled and the training procedure uses a soft ranking loss that aims to bring the related objects closer together.
+Network Embedding
+ +Baselines
+ +Datasets
+ +Lexical Entailment
+* Hyperlex - Gold standard to evaluate how well the semantics models capture lexical entailment on a scale of [0, 10].
+
+* The key takeaway is that for all the datasets/setups, hyperbolic embeddings give a performance benefit when the embedding dimension is small.
+
Hyperbolic embeddings are not suitable for all the datasets. Eg if the dataset is not tree-like or has cycles.
+Hyperbolic embeddings are difficult to optimize as each operation needs to be modified to be usable in the hyperbolic space.
+BabyAI is a research platform to investigate and support the feasibility of including humans in the loop for grounded language learning.
+The setup is a series of levels (of increasing difficulty) to train the agent to acquire a synthetic language (Baby Language) which is a proper subset of English language.
+BabyAI platform provides support for curriculum learning and interactive learning as part of its human-in-the-loop training setup.
+Curriculum learning is incorporated by having a curriculum of levels of increasing difficulty.
+Interactive learning is supported by including a heuristic expert which can provide new demonstrations on the fly to the learning agent.
+The heuristic expert can be thought of as the human-in-the-loop which can guide the agent through the learning process.
+One downside of human-in-the-loop is the poor sample complexity of the learning agent. The heuristic agent can be used to estimate the sample efficiency.
+BabyAI research platform for grounded language learning with a simulated human-in-the-loop.
+Baseline results for performance and sample efficiency for the different tasks.
+MiniGrid - A partially observable 2D grid-world environment.
+Entities - Agent, ball, box, door, keys
+Actions - pick, drop or move objects, unlock doors etc.
+Synthetic Language (a proper subset of English) - Used to give instructions to the agent
+Support for verifying if the task (and the subtasks) are completed or not
+A level is an instruction-following task.
+Formally, a level is a distribution of missions - a combination of initial state of the environment and an instruction (in Baby Language)
+Motivated by curriculum learning, the authors create a series of tasks (with increasing difficulty).
+A subset of skills (competencies) is required for solving each task. The platform takes into account this constraint when creating a level.
+The platform supports a Heuristic expert that simulates the role of a human teacher and knows how to solve each task.
+For any level, it can suggest actions or generate demonstrations (given the state of the environment).
+An imitation learning baseline is trained for each level.
+Data requirement for each level and the benefits of curriculum learning and imitation learning are investigated (in terms of sample efficiency).
+GRU to encode the sentence, CNN to encode the input observation
+FiLM layer to combine the two representations
+LSTM to encode the per-timestep FiLM encoding (timesteps in the environment)
+Two model variants are considered:
+ +Large Model - Bidirectional GRU + attention + large hidden state
+Small Model - Unidirectional GRU + No attention + small hidden state
+Heuristic expert used to generate trajectory and the models are trained by imitation learning (to be used as baselines)
+The key takeaway is that the current deep learning approaches are extremely sample inefficient when learning a compositional language.
+Data efficiency of RL methods is much worse than that of imitation learning methods showing that the current imitation learning and reinforcement learning methods scale and generalize poorly.
+Curriculum-based pretraining and interactive learning was found to be useful in only some cases.
+The paper demonstrates that Memory Augmented Neural Networks (MANN) are suitable for one-shot learning by introducing a new method for accessing an external memory.
+This method focuses on memory content while earlier methods additionally used memory location based focusing mechanisms.
+Here, MANN refers to neural networks that have an external memory. This includes Neural Turning Machines (NTMs) and excludes LSTMs.
+In meta-learning, a learner is learning at two levels.
+The learner is shown a sequence of tasks D1, D2, …, DT.
+When it is training on one of the datasets (say DT), it learns to solve the current dataset.
+At the same time, the learner tries to incorporate knowledge about how task structure changes across different datasets (second level of learning).
+Following are the desirable characteristics for a scalable, combined architecture:
+ +Memory representation should be both stable and element-wise accessible.
+Number of model parameters should not be tied to the size of the memory.
+In standard learning, the goal is to reduce error on some dataset D. In meta-learning, the goal is to reduce the error across a distribution of datasets p(D).
+Each dataset is presented to the model in the form (x1, null), (x1, y0), …, (xt+1, yt) where yt is the correct label (or value) corresponding to the inpuit xt.
+Further, the data labels are shuffled from dataset to dataset.
+The model must learn to hold the data samples in memory till the appropriate candidate labels are presented in the next step.
+The idea is that a model that meta learns would learn to map data representation to correct labels regardless of the actual context of data representation or the label.
+The paper uses NTM as the MANN with one modification.
+In the original formulation, the memories were addressed by both context and location. Location-based addressing is not optimal for the current setup where information encoding is not independent of the sequence.
+A new access module - LRUA - Least Recent Used Access - is used to write to memory.
+LRUA is purely content-based and writes to either least used memory location (to preserve recent information) or most recently used memory location (to overwrite recent information with more relevant information). This is decided on the basis of interpolation between previous read weights and weights scaled according to the usage weight.
+Omniglot (classification)
+Sampled functions from Gaussian Processes
+For the omniglot dataset, the model was trained with various combinations of randomly chosen classes with randomly chosen labels.
+As baselines, following models were considered:
+ +Since each episode (dataset created by the combination of classes) contains unique classes (with their own unique labels) it is important to clear the memory across different episodes.
+For the regression task, the data was generated from a GP prior with a fixed set of hyper-parameters which resulted in different functions.
+For both the tasks, the MANN architecture outperforms the LSTM architecture baseline NTMs.
+The paper introduces a learned gradient descent optimizer that has low memory and computational overhead and that generalizes well to new tasks.
+Uses a hierarchial RNN architecture augmented by features like adapted input an output scaling, momentum etc.
+A meta-learning set of small diverse optimization tasks, with diverse loss landscapes is developed. The learnt optimizer generalizes to much more complex tasks and setups.
+A hierarchical RNN is designed to act as a learned optimizer. This RNN is the meta-learner and its parameters are shared across different tasks.
+The learned optimizer takes as input the gradient (and related metadata) for each parameter and outputs the update to the parameters.
+At the lowest level of hierarchical, a small “parameter RNN” ingests the gradient (and related metadata).
+One level up, an intermediate “Tensor RNN” incorporates information from a subset of Parameter RNNS (eg one Tensor RNN per layer of feedforward network).
+At the highest level is the glocal RNN which receives input from all the Tensor RNNs and can keep track of weight updates across the task.
+the input of each RNN is averaged and fed as input to the subsequent RNN and the output of each RNN is fed as bias to the previous RNN.
+In practice, the hidden states are fixed at 10, 30 and 20 respectively.
+Attention and Nesterov’s momentum
+ +Attention mechanism is incorporated by attending to new regions of the loss surface (which are an offset from previous parameter location).
+To incorporate momentum on multiple timescales, the exponential moving average of the gradient at several timescales is also provided as input.
+The average gradients are rescaled (as in RMSProp and Adam)
+Relative log gradient magnitudes are also provided as input so that the optimizer can access how the gradient magnitude changes with time.
+The paper describes a combinatorial approach to embed trees into hyperbolic spaces without performing optimization.
+The resulting mechanism is analyzed to obtain dimensionality-precision tradeoffs.
+To embed any metric spaces in the hyperbolic spaces, a hyperbolic generalization of the multidimensional scaling (h-MDS) is proposed.
+Hyperbolic Spaces
+ +Have the “tree” like property ie the shortest path between a pair of points is almost the same as the path through the origin.
+Generally, Poincare ball model is used given its advantages like conformity to the Euclidean spaces.
+Fidelity Measures
+ +Mean Average Precision - MAP
+ +Distortion
+ +Embed the given graph G = (V, E) into a tree T.
+Embed the tree T into the poincare ball Hd of dimensionality d.
+Consider two points a and b (from the tree) where b is the parent of a.
+Assume that a is embedded as f(a) and b is embedded as f(b) and the children of a needs to be embedded.
+Reflect f(a) and f(b) across a geodesic such that f(a) is mapped to 0 (origin) while f(b) is mapped to some new point z.
+Children of a are placed at points yi which are equally placed around a circle of radius (er - 1) / (er + 1) and maximally seperated from z, where r is the scaling factor.
+Then all the points are reflected back across the geodesic so that all children are at a distance r from f(a).
+To embed the tree itself, place the root node at the origin, place its children around it in a circle, then place their children and so on.
+In this construct, precision scales logarithmically with the degree of the tree but linearly with the maximum path length.
+In the d-dimensional space, the points are embedded into hyperspheres (instead of circles).
+The number of children node that can be placed for a particular angle grows with the dimension.
+Increasing dimension helps with bushy trees (with high node degree).
+Given the pairwise distance from a set of points in the hyperbolic space, how to recover the points?
+The corresponding problem in the Euclidean space is solved using MDS.
+A variant of MDS called as h-MDS is proposed.
+MDS makes a centering assumption that points have 0 mean. In h-MDS, a new mean (called as the pseudo-Euclidean mean) is introduced to enable recovery via matrix factorization.
+Instead of the Poincare model, the hyperboloid model is used (though the points can be mapped back and forth).
+Given the pairwise distances, a new matrix Y is constructed by applying cosh on the pairwise distances.
+Running PCA on -Y recovers X up to rotation.
+PGA is the counterpart of PCA in the hyperbolic spaces.
+First the Karcher mean of the given points is computed.
+All points xi are reflected so that their mean is 0 in the Poincare disk model.
+Combining that with Euclidean reflection formula and hyperbolic metrics leads to a non-convex loss function which can be optimized using gradient descent algorithm.
+Datasets
+ +Results
+ +Hindsight Experience Replay(HER) is a sample efficient technique to learn from sparse rewards.
+Assume a footballer misses the goal narrowly. Even though the player does not get any “reward”(in terms of goal), the player realizes that had the goal post been shifted a bit, it would have resulted in a goal(reward).
+The same intuition is applied for the RL agent - let us say that the true goal state was g while the agent ends up in the state s.
+While the action sequence is not useful for reaching the goal state g, it is indeed useful for reaching state s. Hence the trajectory could be replayed with the goal as s(and not g).
+Multi-goal policy trained using Universal Value Function Approximation (UVFA).
+Every episode starts by sampling a start state and a goal state. Each goal has a different reward function.
+Policy uses both the current state and the current goal state and leads to a state transition sequence s1, s2,…, sn.
+Each of these transitions si -> si+1 are stored in a buffer with both the original goal and a subset of the other goals.
+For the goal selection, following strategies are tried:
+ +Future - goal state is the state k steps after observing the state transition.
+Final - goal state is the final state of the current episode.
+Episode - k random states are selected from the current episode.
+Randon - k states are selected randomly.
+Any off-policy algorithm can be used. Specifically, DDPG is used.
+Robotic arm simulated using MuJoCo for push, slide and pick and place tasks.
+DDPG with and without HER evaluated on the 3 tasks.
+DDPG with the HER variant significantly outperforms the baseline in all the cases.
+For top-k classification tasks, cross entropy is widely used as the learning objective even though it is the optimal metric only in the limit of infinite data.
+The paper introduces a family of smoothed loss functions that are specially designed for top-k optimization.
+Here s denotes the output of the classifier model to be learnt, y is the ground truth label, s[p] denotes the kth largest element of s and s\p denotes the vector s without pth element.
+This lk loss has two limitations:
+ +It is continous but not differentiable in s.
+Its weak derivatives have at most 2-nonzero elements.
+The loss can be reformulated by adding and subtracting the k-1 largest scores of s\y and sy and by introducing a temperature parameter τ.
+For any τ > 0, Lkτ is infinite-differentiable and has non-sparse gradients.
+Under mild conditions, Lkτ apporachs lk (in a pointwise sense) as τ approaches to 0++.
+It is an upper bound on the actual loss (up to a constant factor).
+It is a generalization of the cross-entropy loss for different values of k, and τ and higher margins.
+nCk number of terms needs to be evaluated for computing the loss for one sample (n is number of classes).
+Loss Lkτ can be expressed in terms of elementary symmetric polynomials σi(e) (sum of all products of i distinct elements of vector e). Thus the challenge is to compute σk efficiently.
+Compute σk(e) where e is a n-dimensional vector and k« n and e[i]!=0 for all i.
+σi(e) can be computed using the coefficients of the polynomial (X+e1)(X+e2)…(X+en) by divide and conquer approach with polynomial multiplication.
+With some more optimizations (eg log(n) levels of recursion and each level being parallelized on a GPU), the resulting algorithms scale well with n on a GPU.
+Operations are performed in the log-space using the log-sum-exp trick to achieve numerical stability in single floating point precision.
+The backward pass uses optimizations like computing derivative of σj with respect to ei in a recursive manner.
+Appendix of the paper describes these techniques in detail.
+Experiments are performed on CIFAR-100 (with noise) and Imagenet.
+For CIFAR-100 with noise, the labels are randomized with probability p (within the same top-level class).
+The proposed loss function is very robust to both noise and reduction in the amount of training dataset as compared to cross-entropy loss function for both top-k and top-1 performance.
+The paper proposes a pretraining technique that can be used with the GNN architecture for learning graph representation as induced by powerful graph kernels.
+Graph Kernel methods can learn powerful representations of the input graphs but the learned representation is implicit as the kernel function actually computes the dot product between the representations.
+GNNs are flexible and powerful in terms of the representations they can learn but they can easily overfit if a large amount of training data is not available as is commonly the case of graphs.
+Kernel methods can be used to learn an unsupervised graph representation that can be finetuned using the GNN architectures for the supervised tasks.
+Given a dataset of graphs g1, g2, …, gn, use a relevant kernel function to compute k(gi, gj) for all pairs of graphs.
+A siamese network is used to encode the pair of graphs into representations f(gi) and f(gj) such that dot(f(gi), f(gj)) equals k(gi, gj).
+The function f is trained to learn the compressed representation of kernel’s feature space.
+Pretraining uses the WL kernel
+Pretrained model performs better than the baselines for 2 datasets but lags behind WL method (which was used for pretraining) for the NCI1 dataset.
+A new (and more realistic) evaluation protocol for lifelong learning where each data point is observed just once and a disjoint set of tasks are used for training and validation.
+A new metric that focuses on the efficiency of the models - in terms of sample complexity and computational (and memory) costs.
+Modification of Gradient Episodic Memory ie GEM which reduces the computational overhead of GEM without compromising on the results.
+Empirical validation that using task descriptors help lifelong learning models and improve their few-shot learning capabilities.
+Two group of datasets - one for training and evaluation (DEV) and other for cross validation (DCV).
+Data can be sampled multiple times for cross-validation dataset but only once from the training dataset.
+Each group of dataset (say DEV or DCV) is a list of task-specific datasets Dk (k is the task index).
+Each sample in Dk is of the form (x, t, y) where x is the data, t is the task descriptor and y is the output.
+Dk contains Bk minibatches of data.
+ak,i,j = accuracy on test task j after training on ith minibatch of training task k.
+Ak = mean over all j = 1 to k (ak, Bk, j) ie train the model on data for task k and then test it on all the tasks.
+fjk = forgetting on task j after training on all minibatches upto task k.
+fjk = max over all l = 1 to k-1 (al, Blj - ak, Bkj)
+Forgetting = Fk = mean over all j = 1 to k-1 (fjk)
+Zb = average b shot performance where b is the minibatch number.
+Zb = mean over all k = 0 to T (ak, b, k)
+LCAβ = mean over all b = 0 to β (Zb)
+One special case is LCA0 which is the forward transfer performance or performance on the unseen task.
+In experiments, β is kept small as we want the model to learn from few examples.
+GEM has been shown to be very effective in single epoch setting but introduces a very high computational overhead.
+Average GEM (AGEM) reduces this overhead by sampling (and using) only some examples from the episodic memory instead of using all the examples.
+While GEM provides better guarantees in terms of worst-case forgetting, AGEM provides better guarantees in terms of average accuracy.
+Compositional Task Descriptors are used to speed training on the subsequent tasks.
+A matrix specifying the attribute value of objects (to be recognized in the task) are used.
+A joint-embedding space between image features and attribute embeddings is learned.
+Integer task descriptors for MNIST and CIFAR and class attributes as descriptors for CUB and AWA
+Baselines include GEM, iCaRL, Elastic Weight Consolidation, Progressive Neural Networks etc.
+AGEM outperforms other models on all the datasets expect MNIST where the Progressive Neural Networks lead. One reason could be that MNIST has a large number of training examples per task. But Progressive Neural Networks lead to bad utilization of capacity.
+While AGEM and GEM have similar performance, GEM has a much higher computational and memory overhead.
+Use of task descriptors improves the accuracy for all the models.
+It seems that AGEM offers a good tradeoff between average accuracy performance and efficiency - in terms of sample efficiency, memory requirements and computational costs.
+The paper proposes a simple and robust approach for hierarchically training an agent in the sparse reward setup.
+The broad idea is to train low-level primitives that are sufficiently diverse (so that they can be composed for solving higher level tasks) and to train a high level primitive that learns to combine these primitives for any given downstream task.
+Low-level policies should be:
+ +For the low-level policy, the per-time step reward is directly proportional to change in the external state. The same reward is used for all the agents and environments(except regulated with environment specific controls and survival rewards).
+Good movement policies are expected to be at least roughly periodic and phase input (or time index) is used to achieve periodicity.
+Phase conditioned policy (=f(sp, φ)) where φ = {0, 1, …, k-1} is the phase index.
+At each timestep t, the model receives observation sp and phase index φ = t%k. The phase index is represented by a vector bφ.
+For phase conditioned policies, the agent state and actions are encouraged to be cyclic with the help of a cyclic loss.
+Environments: Ant and Humanoid from Mujoco.
+Low-level control:
+ +High-level control:
+ +Cross Maze Environment with fixed goals
+ +3 goals along 3 paths
+Proposed method converges faster and to a smaller final distance to the goal showing that it is both efficient and consistent (with smaller variance across random seeds).
+Random Goal Maze
+ +The goal is randomly drawn from a set of goals.
+“Cross” (shaped) maze and “skull” (shaped) mazes are considered.
+Even with velocity rewards and pretraining on low-level objectives (which can be thought of as exploration bonuses), the baseline fails to get close to the goal locations while the proposed model reach the goal most of the times.
+The main results are reported using PPO though repeating the experiments with A2C and DQN show that the idea is fairly robust.
+The paper reported that in their experiments, finetuning the lower level primitives did not help much though it might not be the case of other environments.
+The paper proposes an approach for learning neural networks (modules) that can be combined in different ways to solve different tasks (combinatorial generalization).
+The proposed model is called as BOUNCEGRAD.
+Focuses on supervised learning.
+Task distribution p(T).
+Each task is a joint distribution pT(x, y) over (x, y) data pairs.
+Given data from m meta-training tasks, and a meta-test task, find a hypothesis h which performs well on the unseen data drawn from the meta-test task.
+Given a compositional scheme C, a set of modules F1, …, Fk (represented as a whole by F) and the set of their respective parameters θ1, …, θk (represented as a whole by θ), (C, F, θ) represents the set of possible functional input-output mappings. These mappings form the hypothesis space.
+A structured hypothesis model is specified by what modules to use and their parametric forms (but not the values).
+Choosing a single module for the task at hand.
+Fixed compositional structure but different modules selected every time.
+Weight ensemble (maybe using attention mechanism)
+General function composition tree
+Offline Meta Learning Phase:
+ +Take training and validation dataset for the first k tasks and generate a parameterization for each module θ1, …, θk.
+The hypothesis (or composition) to use comes from the online meta-test learning phase.
+In this stage, find the best θ given a structure.
+Online Meta-test Learning Phase
+ +Given a hypothesis space and θ, the output is a compositional form (or hypothesis) that specifies how to compose the models.
+In this stage, find the best structure, given a hypothesis space and θ.
+During Meta-test learning phase, simulated annealing is used to find the optimal structure, with temperature T decreased over time.
+During meta-learning phrase, the actual objective function is replaced by a surrogate, smooth objective function (during the search step) to avoid local minima.
+Once a structure has been picked, any gradient descent based approach can be used to optimize the modules.
+Basically the state of optimization process comprises of the parameters and the temperature. Together, they are used to induce a distribution over the structures. Given a structure, θ is optimized and T is annealed over time.
+The learning procedure can be improved upon by performing parameter tuning during the online (meta-test learning) phase as well. the resulting approach is referred to as MOMA - MOdular MAml.
+Pooled - Single network using combined data of all the tasks.
+MAML - Single network using MAML
+BOUNCEGRAD - Modular Network without MAML adaptation in online learning.
+MOMA - BOUNCEGRAD with MAML adaptation in online learning.
+Sine-function prediction problem
+In general, MOMA outperforms other models.
+With a small amount of online training data, BOUNCEGRAD outperforms other models as it has a better structural prior.
+11 different objects (with different shapes) on 4 surfaces with different friction properties.
+2 meta-learning scenarios are considered. In the first case, the object-surface combination in the test case was present in some meta-training tasks and in the other case, it was not present.
+For previously seen combinations, MOMA performs the best followed by BOUNCEGRAD and MAML.
+For unseen combinations, all the 3 are equally good.
+Compositional scheme is the attention mechanism.
+An interesting result is that the modules seem to specialize (and activate more often) based on the shape of the object.
+Composition Structure - generating kinematics subtrees for each body part (2 legs, 2 arms, 2 torsi).
+Again 2 setups are used - one where all activities in the training and the meta-test task are shared while the other setup where the activities are not shared.
+For known activities MOMA and BOUNCEGRAD perform the best while for unknown activities, MOMS performs the best.
+While the approach is interesting, maybe a more suitable set of tasks (from the point of composition) would be more convincing.
+It would be useful to see the computational tradeoff between MAML, BOUNCEGRAD, and MOMA.
+The paper proposes an approach to learn useful skills without a reward function by maximizing an information theoretic objective by using a maximum entropy policy.
+Skills are defined as latent-conditioned policies that alter the state of the environment in a consistent way.
+Skills should dictate the states that the agent visits. Different skills should visit different states to be distinguishable.
+States (not actions) should be used to distinguish between skills as not all actions change the state (for the outside observer).
+Skills are encouraged to be diverse and “exploratory” by learning skills that act randomly (have high entropy).
+(S, A) - state and action
+z ~ p(z) - latent variable to condition the policy.
+Skill - policy conditioned on a fixed z.
+Objective is to maximize the mutual information between skill and state (MI(A; Z)) ie skill should control which state is visited or the skill should be inferrable from the state visited.
+Simultaneously minimize the mutual information between skills and actions given the state to ensure that the state (and not the action) is used to distinguish the skills.
+Maximize the entropy of the mixture of policies (p(z) and all the skills).
+Policy π(a | s, z)
+Task reward replaced by the pseduoreward logqφ(z | s) - log(p(z)).
+During unsupervised training, z is sampled at the start of the episode and then not changed during the episode.
+Learning agent gets rewards for visiting the states that are easy to discriminate while the discriminator updated to correctly predict z from the states visited.
+The agent learns a diverse set of primitive behaviors for all tasks ranging from 2 DoF to 111 DoF.
+for inverted pendulum and mountain car, the skills become increasingly diverse throughout training.
+Use of uniform prior, in place of a learned prior, for p(z) allows for discovery of more diverse skills.
+The proposed approach can be used as a pretraining technique where the best-performing primitives (from unsupervised training) can be finetuned with the task-specific rewards.
+The discovered skills can be used for hierarchical RL by learning a meta-policy(which chooses the skill to execute for k steps).
+Modifying the discriminator in the proposed formulation can be used to bias DIAYN towards discovering a particular type of policies. This provides a mechanism for incorporating “supervision” in the learning setup.
+The “discovered” primitives can also be used for imitation learning.
+Training RNNs to model long term dependencies is difficult but in some cases, the information about dependencies between elements (of the sequence) may be present in the form of symbolic knowledge.
+For example, when encoding sentences, coreference, and hypernymy relations can be extracted between tokens.
+These elements(tokens) can be connected with each other with different kind of edges resulting in the graph data structure.
+One approach could be to model this knowledge(encoded in the graph) using a graph neural network (GNN).
+The authors prefer to encode the information into 2 DAGs (via topological sorting) as training the GNN could add some extra overhead.
+This results into the Memory as Acyclic Graph Encoding RNN (MAGE-RNN) architecture. Its GRU version is referred to as MAGE-GRU.
+Given an input sequence of tokens [x1, x2, …, xT] and information about which tokens relate to other tokens, a graph G is constructed with different (possibly typed) edges.
+Given the graph G, two DFS orderings are computed - forward DFS and backward DFS.
+MAGE-RNN uses separate networks for accessing the forward and backward DFS orders.
+A separate hidden state is maintained for each entity type to separate memory content from addressing.
+For any DFS order (forward or backward), the representation at time t is given as the concatenation of representation of different edge types at that time.
+The hidden states (for different edge types at time t) are updated in the topological order using the current state of all incoming edges at xt.
+The representation of the DFS order is given as the sequence of all the previous representations.
+In some cases, elements across multiple sequences could be related to each other. In that case, the graph is decomposed into a collection of DAGs and use MAGE-GRU on the DAGs by taking one random permutation of the sequences and decomposing it into the forward and the backward graphs.
+The model is evaluated on the task of text comprehension with coreference on bAbi dataset (story based QA), LAMBADA dataset (broad context language modeling) and CNN dataset (cloze-style QA).
+MAGE-GRU was used as a replacement for GRU units in bi-directional GRUs and GA-Reader architecture.
+DAG-RNN and shared version of MAGE-GRU (with shared edge types) are the other baselines.
+For all the cases, the model with MAGE-GRU works the best.
+TuckER is a simple, yet powerful linear model that uses Tucker decomposition for the task of link prediction in knowledge graphs.
+Let E be the set of all the entities and R be the set of all the relations in a given knowledge graph (KG).
+The KG can be represented as a list of triples of the form (source entity, relation, object entity) or (es, r, eo).
+The list of triples can be represented as a third-order tensor (of binary values) where each element corresponds to a triple and each element’s value corresponds to ether that element is present in the KG or not.
+The link prediction task can be formulated as - given a set of all triples, learn a scoring function that assigns a score to each triple. The score indicates whether the triple is actually present in the KG or not.
+Tucker decomposition factorizes a tensor into a set of factor matrices and a smaller core tensor.
+In the specific case of three-mode tensors (alternate representation of a KG), the given original tensor X (of shape IxJxK) can be factorized into a core tensor W (of shape PxQxR) and 3 factor matrics - A (of shape IxP), B (of shape JxQ) and C (of shape KxR) such that X is approximately W x1 A x2 B x3 C, where Xn denotes the tensor product along the nth mode.
+Generally, P, Q, R are smaller than I, J K (respectively) and W can be seen as a compressed version of X.
+Two embedding matrics are used for embedding the entities and the relations respectively.
+Entity embedding matrix E is shared for both subject and the object ie E = A = B.
+The scoring function is gives as W x1 es x2 wr x3 e0 where es, wr and eo are the embedding vectors corresonding to es, er and eo respectively. Note that both the core tensor and the factor matrices are to be learnt.
+Model is trained with the standard negative log-likelihood loss given as (for one triple): y * log(p) + (1-y) * log(1-p)
+To speed up training and increase accuracy, 1-N scoring is used. A given (es, r) is simultaneously scored for all the entities using the local-closed world assumption (knowledge graph is only locally complete).
+Handling asymmetric relations is straightforward by learning a relation embedding alongside a relation-agnostic core tensor which enables knowledge sharing across relations.
+One important consideration would be the expressive power of TuckER models, especially in relation to other models like ComplEx and SimplE.
+It can be shown the TuckER is fully expressive ie give any ground truth over E and R, there exists a TuckER model which can perfectly represent the data - using 1-hot entity and relation embedding.
+For full expressiveness, dimensionality of entity (relation) is nE (nR) where nE (nR) are the number of entities (relations). In comparsion, the required dimensionality for ComplEx is nE * nR (for both entity and relations) and for SimplE, it is min(E * nR, number of facts + 1) (for both entity and relations).
+Many existing models like RESCAL, DistMult, ComplEx, SimplE etc can be seen as special cases of TuckER.
+FB15k, FB15k-237, WN18, WN18RR
+The max number of entities is around 41K and max number of relations is around 1.3K
+Mean Reciprocal Rank (MRR) - the average of the inverse of mean rank assigned to the true triple overall ne generated triples.
+hits@k (k = 1, 3, 10) - percentage of times the true triple is ranked in the top k of the ne generated triples.
+Higher is better for both the metrics.
+TuckER outperforms all the baseline models on all but one task.
+Dropout is an important factor with higher dropout rates (0, 3, 0.4, 0.5) needed for datasets with fewer training examples per relation (hence more prone to overfitting).
+TuckER improves performance more significantly when the number of relations is large.
+Even with lower embedding dimensions, TuckER’s performance does not deteriorate as much as other models.
+The paper presents a framework that uses diverse suboptimal world models that can be used to break complex policies into simpler and modular sub-policies.
+Given a task, both the sub-policies and the controller are simultaneously learned in a bottom-up manner.
+The framework is called as Model Primitive Hierarchical Reinforcement Learning (MPHRL).
+Instead of learning a single transition model of the environment (aka world model) that can model the transitions very well, it is sufficient to learn several (say k) suboptimal models (aka model primitives).
+Each model primitive will be good in only a small part of the state space (aka region of specialization).
+These model primitives can then be used to train a gating mechanism for selecting sub-policies to solve a given task.
+Since these model primitives are sub-optimal, they are not directly used with model-based RL but are used to obtain useful functional decompositions and sub-policies are trained with model-free approaches.
+A gating controller is trained to choose the sub-policy whose model primitive makes the best prediction.
+This requires modeling p(Mk | st, at, st+1) where p(Mk) denotes the probability of selecting the kth model primitive. This is hard to compute as the system does not have access to st+1 and at at time t before it has choosen the sub-policy.
+Properly marginalizing st+1 and at would require expensive MC sampling. Hence an approximation is used and the gating controller is modeled as a categorical distribution - to produce p(Mk | st). This is trained via a conditional cross entropy loss where the ground truth distribution is obtained from transitions sampled in a rollout.
+The paper notes that technique is biased but reports that it still works for the downstream tasks.
+The gating controller composes the sub-policies as a mixture of Gaussians.
+For learning, PPO algorithm is used with each model primitives gradient weighted by the probability from the gating controller.
+Domains:
+ +Mujoco ant navigating different mazes.
+Stacker arm picking up and placing different boxes.
+Implementation Details:
+ +Gaussian subpolicies
+PPO as the baseline
+Model primitives are hand-crafted using the true next state provided by the environment simulator.
+Single Task
+ +Only maze task is considered with the start position (of the ant) and the goal position is fixed.
+Observation includes distance from the goal.
+Forcing the agent to decompose the problem, when a more direct solution may be available, causes the sample complexity to increase on one task.
+Lifelong Learning
+ +Maze
+ +10 random Mujoco ant mazes used as the task distribution.
+MPHRL takes almost twice the number of steps (as compared to PPO baseline) to solve the first task but this cost gets amortized over the distribution and the model takes half the number of steps as compared to the baseline (summed over the 10 tasks).
+Pick and Place
+ +8 Pick and Place tasks are created with max 3 goal locations.
+Observation includes the position of the goal.
+Ablations
+ +Overlapping model primitives can degrade the performance (to some extent). Similarly, the performance suffers when redundant primitives are introduced indicating that the gating mechanism is not very robust.
+Sub-policies could quickly adapt to the previous tasks (on which they were trained initially) despite being finetuned on subsequent tasks.
+The order of tasks (in the 10-Mazz task) does not degrage the performance.
+Transfering the gating controller leads to negative transfer.
+Notes
+ +The paper provides useful empirical advice for adapting pretrained language models for a given target task.
+Pre-trained models considered
+ +ELMo
+BERT
+Tasks considered
+ +Named Entity Recognition (NER) - CoNLL 2003 dataset
+Sentiment Analysis (SA) - Stanford Sentiment Treebank (SST-2) dataset
+Natural Language Inference (NLI) - MultiNLI and Sentences Involving Compositional Knowledge (SICK-E) dataset
+Paraphrase Detection (PD) - Microsoft Research Paraphrase Corpus (MRPC)
+Semantic Textual Similarity (STS) - Semantic Textual Similarity Benchmark (STS-B) and SICK-R
+The last 3 tasks (NLI, PD, STS) are defined for sentence pairs.
+Adaptation Strategies
+ +Feature Extraction
+ +The pretrained model is only used for extracting features and its weights are kept fixed.
+For both ELMo and BERT, the contextual representation of the words from all the layers are extracted.
+A weighted combination of these layers is used as an input to the task-specific model.
+Task-specific models
+ +NER - BiLSTM with CRF layer
+SA - bi-attentive classification network
+NLI, PD, STS - Enhanced Sequential Inference Model (ESIM)
+Fine-tuning
+ +The pretrained model is finetuned on the target task.
+Task-specific models for ELMO
+ +NER - CRF on top of LSTM states
+SA - Max-pool over the language model states followed by a softmax layer
+NLI, PD, STS - cross sentence bi-attention between the language model states followed by pooling and softmax layer.
+Task-specific models for BERT
+ +NER - Extract representation of the first-word piece of each token followed by the softmax layer
+SA, NLI, PD, STS - standard BERT training
+Main observations
+ +Feature extraction and fine-tuning have comparable performance in most cases unless the two tasks are highly similar(fine-tuning is better) or highly dissimilar (feature extraction is better).
+For ELMo, feature extraction consistently outperforms fine-tuning for the sentence pair tasks (NLI, PD, STS). The reverse trend is observed for BERT with fine-tuning being better on sentence pair tasks.
+Adding extra parameters is helpful for feature extraction but not fine-tuning.
+ELMo fine-tuning requires careful tuning and other tricks like triangular learning rates, gradual unfreezing and discriminative fine-tuning.
+For the tasks considered, there is no correlation observed between the distance of the source and target domains and adaptation performance.
+Training a diagnostic classifier (on the intermediate representations) suggests that fine-tuning improves the performance of the classifier at all the intermediate layers (which is sort of expected).
+In terms of mutual information estimates, fine-tuned representations have a much higher mutual information as compared to the feature extraction based representations.
+Knowledge for single sentence tasks seems to be mostly concentrated in the last layers while for pair classification tasks, the knowledge seems gradually build un in the intermediate layers, all the way up to the last layer.
+Graph Neural Network (GNN) is a family of powerful machine learning (ML) models for graphs that can combine node information with the structural information.
+One downside of GNNs is that their predictions are hard to interpret.
+The paper proposes GNN Explainer model for solving the problem of interpretability.
+Local edge fidelity - identify the subgraph structure (ideally the smallest) that significantly affected the predictions of the GNN. ie identify the important edges in the graph (for a given prediction).
+Local node fidelity - identify the import node features and correlations in the features of the neighboring nodes.
+Single instance and multi-instance explanations - Support both single instance prediction tasks and multi-instance prediction tasks.
+Model Agnostic - Support a large family of models (ideally all)
+Task Agnostic - Support a large family of tasks (ideally all)
+I first describe the single instance prediction case and use that as the base to describe the multiple instance prediction cases. All the discussion in this section assumes a single instance prediction task.
+Input: Trained GNN, a single instance whose prediction is to be explained.
+Task: Identify the small subgraph and the small subset of features that explain the prediction.
+Idea: Maximize the mutual information (MI) between the GNN and the explanation by learning a graph mask which can be used for selecting the relevant subgraph (from the GNN’s computational graph) and features (from all layers of the GNN).
+Computational graph of GNN (corresponding to a node) refers to the approx L-hop neighborhood of the node in the graph ie the subgraph formed by nodes and edges whose representation affected the representation of the given node.
+For a node v, the information used to predict its label y is completely described by its computation graph Gc(v) and the associated feature set Xc(v). The feature set includes the features of all the nodes in the computation graph.
+When constructing the explaination, only Gc(v) and Xc(v) are used.
+The task can be reformulated as identifying a subgraph GS (subset of Gc(v)) with associated features XS which are important when predicting the label y for node v.
+“Importance” is measured in terms of MI
+MI(Y, (GS, XS)) = H(Y) - H(Y | G = GS, X = XS) where H is the entropy and Y is a random variable representing the prediction.
+ +A further constraint, | GS| < k is imposed to obtain consise explaintations.
+Since H(Y) is fixed (recall that the network has already been trained and is now being used in the inference mode), maximizing MI is equivalent to minimizing the conditional entropy H(Y | G = GS, X = XS)
+This is equivalent to selecting the subgraph that minimizes the uncertainty in the prediction of y when the computational graph is Gc(v)
+Given the exponentially large number of possible subgraphs, we can not directly optimize the given equation.
+A “relaxed”-adjacency matrix (whose values are real numbers in the range 0 to 1) is introduced where each element of this fractional adjacency matrix is smaller than the corresponding element of the original adjacency matrix. Gradient descent can be performed on this adjacency matrix.
+The “relaxed” GS can be interpreted as a variational approximation of the subgraph distributions of Gc(v) and the objective can be written as min EGSH(Y | G = GS, X = XS)
+Now the paper makes a big approximation that the GNN is convex so as to leverage the Jensen inequality and push the expectation inside the entropy term to get an upper bound and then minimize that ie min H(Y | G = Es[GS], X = XS)
+The paper reports that the convexity approximation (along with discreteness constraint) works in practice.
+Next, mean field approximation is used to decompose P(GS) as a multivariate Bernoulli distrbitution ie product of AS(i, j) for all (i, j) belonging to Gc(v). AS can be optimized directly and its values represent the expectation of the Bernoulli distrbitution on wether the edge ei, j exists.
+Given the constraints on AS, it is easier to learn a mask matrix M and optimize that such that AS = M * Ac* Additionally, the sigmod operator can be applied on M.
+Once M is learned, only the top k values are retained.
+Similar to the previous approach, another feature mask is learned (either one for entire GNN or one per node of the GNN) and is used as a feature selector.
+The mask could either be learned such that same set of node features (in terms of dimensions) are selected or a different set of features are selected per node. The paper uses the former as it is more straightforward.
+Just like before, a “relaxed” mask MT is trained to select features as MT * XS.
+One tricky case is where one feature is important but its value is set to 0. In the case, the value will be masked even though it should not be
+The workaround is to use Monte Carlo (MC) estimates of marginals of the missing features. This gives a way to assign importance scores to each feature dimension and a form of reparameterization trick is used to perform end-to-end learning.
+Masks are encouraged to be discrete by regularizing their element-wise entropy.
+Resulting computation graph is valid as in it allows message passing towards the central node v.
+Given a set of nodes (having the label say y), the task is to obtain a global explanation of the predictions.
+For the given class, a prototypical reference node is chosen by computing the mean of embeddings of all the nodes in the class and then selecting the node which is closest to the mean.
+Now, compute the important computational graph corresponding to this node and align the computational subgraphs of all the other nodes (in the given class) to reference.
+Let A* be the adjacency matrix and X* be the feature matrix for the explanation corresponding to the reference node. Let Av and Xv be the adjacency matrix and feature matrix of the to-ber-aligned computational graph.
+A relaed alignment matrix P is optimized to align the nodes and features in the two graphs ie we minimize |PTAvP - A*| + *|PTXvP - X*|
+Choosing concise explanations helps in efficient graph matching.
+For GNNs that compute attention over the entire graph, edges with low attention weight can be pruned to increase efficiency.
+Datasets
+ +Node classification: BA-Shapes, BA-Community, Tree-Cycles, Tree-Grid
+Graph classification: MUTAG, Reddit-Binary
+Baselines
+ +GRAD - Compute the gradient of the model loss with respect to the adjacency matrix and the node features to be classified and fix the edges with the highest absolute gradient.
+GAT - Graph Attention Network
+The proposed model seems to outperform the baselines both qualitatively and quantitatively. But the results should be taken with a grain of salt as only 2 baselines are considered.
+Standard unsupervised learning aims to learn transferable features. The paper proposes to learn a transferable learning rule (in an unsupervised manner) that can generalize across tasks and architectures.
+Consider training the model with supervised learning - φt+1 = SupervisedUpdate(φt, xt, yt, θ).
+Here t denotes the step, (x, y) denotes the data points, θ denotes the hyperparameters of the optimizer.
+Extending this formulation for meta-learning, one could say that t is the step of the inner loop, θ are the parameters of the meta learning model.
+Further, the paper proposes to use φt+1 = UnsupervisedUpdate(φt, xt, θ) ie yt is not used (or even assumed to be available as this is unsupervised learning).
+The meta update rule is used to learn the weights of a meta-model by performing SGD on the sum of MetaObjective over the distribution of tasks (over the course of inner loop training).
+Base model: MLP with parameters φt
+To ensure that it generalizes across architectures, the update rule is designed to be neural-local ie updates are a function of pre and postsynaptic neurons though, in practice, this constraint is relaxed to decorrelate neurons by using cross neural information.
+Each neuron i in every layer l (in the base model) has an update network (MLP) which takes as input the feedforward activations, feedback weights and error signals. ie hbl(i) = MLP(xbl(i), zbl(i), vl+1, +δl(i), θ)
+ +All the update networks share the meta parameters θ
+The model is run in a standard feed-forward manner and the update network (corresponding to each unit) is used to generate the error signal δlb(i) = lin(hbl(i)).
+This loss is backpropogated using the set of learned backward weights vl instead of the forward weights wl.
+The weight update Δwl is also generated using a per-neuron update network.
+The MetaObjective is based on fitting a linear regression model to labeled examples with a small number of data points.
+Given the emphasis on learning generalizable features, the weights (of linear regression) are estimated on one batch and evaluated on another batch.
+The MetaObjective is to reduce the cosine distance between yb and vTxbL
+ +yb - Actual lables on the evaluation batch
+xbL - Features of the evaluation batch (using the base model)
+v - parameters of the linear regression model (learned on train batch)
+Meta gradients are approximated using truncated backdrop through time.
+Increasing variation in the training dataset helps the meta optimization process. Data is augmented with shifts, rotations, and noise. Predicting these coefficients is an auxiliary (regression) task for training the meta-objective.
+Training the system requires a lot of resources - 8 days with 512 workers.
+With standard unsupervised learning, the performance (on transfer task) starts declining after some time even though the performance (on the unsupervised task) is improving. This suggests that the objective function for the two tasks starts to mismatch.
+UnsupervisedUpdate leads to a better generalization as compared to both VAE and supervised learning (followed by transfer).
+UnsupervisedUpdate also leads to a positive transfer across domains (vision to language) when trained for a shorter duration of time (to ensure that the meta-objective does not overfit).
+UnsupervisedUpdate also generalizes to larger model architectures and different activation functions.
+Continual Learning paradigm focuses on learning from a non-stationary stream of data with additional desiderata - transferring knowledge from previously seen task to unseen tasks and being resilient to catastrophic forgetting - all with a fixed memory and computational budget.
+This is in contrast to the IID (independent and identically distributed) assumption in statistical learning.
+One common example of the non-iid data is setups involving sequential decision making - eg Reinforcement learning.
+Many existing benchmarks use MNIST as the underlying dataset (eg Permuted MNIST, Split MNIST, etc). These benchmarks lack complexity and make it hard to observe positive and negative backward transfer.
+Most works focus only on the catastrophic forgetting challenge and ignore the other issues (like computation and memory footprint, the capacity of the network, etc).
+The paper proposes a new benchmark based on Starcraft II video game to understand the different approaches for lifelong learning.
+The sequence of tasks is designed to be a curriculum - the learning agent stats with learning simple skills and later move to more complex tasks. These complex tasks require remembering and composing skills learned in the earlier levels.
+To evaluate for catastrophic forgetting, the tasks are designed such that not all the skills are needed for solving each task. Hence the learning agent needs to remember skills even though they are not needed at the current level.
+Each level comes with a fixed computational budget of episodes and each episode has a fixed time limit. Once the budget is consumed the agent has to proceed to the next level. Hence agents with better sample efficiency would benefit.
+The benchmark supports both RL and supervised learning version. In the supervised version, expert agents (pretrained on each level) are also provided.
+Baselines are provided for distillation (using experts): sequential training (fine tuning), Dropout and SER. None of the baseline methods achieve positive or negative backward transfer.
+When modeled as a pure RL task, the benchmark is extremely difficult to solve.
+The paper suggests using a metric to record the amount of learning/data required to recover performance on the previous task.
+The paper presents some general ideas and mechanisms for multiple model-based RL. Even though the task and model architecture may not be very relevant now, I find the general idea and the mechanisms to be quite useful. As such, I am focusing only on high-level ideas and not the implementation details themselves.
+The main idea behind Multiple Model-based RL (MMRL) is to decompose complex tasks into multiple domains in space and time so that the environment dynamics within each domain is predictable.
+MMRL proposes an RL architecture composes of multiple modules, each with its own state prediction model and RL controller.
+The prediction error from each of the state prediction model defines the “responsibility signal” for each module.
+This responsibility signal is used to:
+ +Weigh the state prediction output ie the predicted state is the weighted sum of individual state predictions (weighted by the responsibility signal).
+Weigh the parameter update of the environment models as well as the RL controllers.
+Weighing the action output - ie predicted action is a weighted sum of individual actions.
+The framework is amenable for incorporating prior knowledge about which module should be selected.
+In the modular decomposition of a task, the modules should not change too frequently and some kind of spatial and temporal continuity is also desired.
+Temporal continuity can be accounted for by using the previous responsibility signal as input during the current timestep.
+Spatial continuity can b ensured by considering a spatial prior like the Gaussian spatial prior.
+Though model-free methods could be used for learning the RL controllers, model-based methods could be more relevant given that the modules are learning state-prediction models as well.
+Exploration can be ensured by using a stochastic version of greedy action selection.
+One failure mode for such modular architectures is when a single module tries to perform well across all the tasks. The modules themselves should be relatively simplistic (eg linear models) which can learn quickly and generalize well.
+Non-stationary hunting task in a grid world and non-linear, non-stationary control task of swinging up a pendulum provides the proof of concept for the proposed methods.
+The paper introduces a simple data augmentation protocol that provides a good compositional inductive bias for sequential models.
+Synthetic examples are created by taking real sequences and replacing the fragments in sequences which appear in similar environments. This operation is referred to as GECA (Good Enough Compositional Augmentation).
+The underlying idea is that if two fragments of training examples occur in some environment, then any environment where the first fragment appears is also a valid environment for the second fragment.
+Discover substitutable fragments (ie pairs of fragments that co-occur with a common fragment) and use them to generate new sequences by swapping fragments.
+The current work uses very simple criteria to decide if fragments are substitutable - fragments should occur in at least one lexical environment that is exactly the same. A lexical environment is the k-word window around each span of the fragment.
+Though the idea can be motivated by work in generative syntax and distributional semantics, it would not hold like a physical law when applied to the real data.
+The authors view this tradeoff as a balance between the shortage of training data vs relative frequency of mistake in the proposed data augmentation approach.
+The approach is evaluated on the SCAN dataset when the model is trained on the short sequence of English commands. Though the dataset augmentation helps the baseline models, it is not surprising given the nature of the SCAN dataset.
+More challenging tasks (for evaluating the proposed approach) are semantic parsing (where the query is represented in the form of λ calculus or SQL and low resource language modeling. While the improvement (in terms of metrics) is sometimes limited, the gains are consistent across different datasets.
+Given that the proposed approach is relatively simple and straightforward, it appears to be quite promising.
+Relational Reinforcement Learning (RRL) paradigm uses relational state (and action) space and policy representation to leverage the generalization capability of relational learning for reinforcement learning.
+The paper shows that effectiveness of RRL - in terms of generalization, sample efficiency and interplay - using box-world and StarCraft II minigames.
+The main idea is to use neural network models that operate on structured representations and perform relational reasoning via iterated, message-passing style methods.
+Use of non-local computations using a shared function (in terms of pairwise interactions between entities) provides a better inductive bias.
+Multi-head dot product attention mechanism is used to model the pairwise interactions (with one or more attention blocks).
+Iterative computations can be used to capture higher-order interactions between entities.
+Entity extraction is based on the assumption that entities are things located at a particular point in space.
+A CNN is used to parse the pixel space observation into k feature maps of size nxn. The (x, y) coordinates are concatenated to each k-dimensional pixel feature-vector to indicate the pixel’s position in the map.
+The resulting n2 x k matrix acts as the entity matrix.
+Actor-critic architecture (using distributed agent IMPALA) is used.
+12 x 12-pixel room with keys and boxes placed randomly.
+Agent can move in 4 directions.
+The task is to collect gems by unlocking boxes (which may contain keys to unlock other boxes).
+Each level has a unique sequence in which boxes need to be opened as opening the wrong box could make the level unsolvable.
+Difficulty of a level can be controlled using: (i) Number of boxes in the path to the goal. (ii) The number of distractor branches, (iii) Length of distractor branches.
+RRL agents solve over 98% of the levels while the RL agent solves less than 95% of the levels.
+Visualising the attention scores indicate that:
+ +keys attend to locks they can unlock.
+all objects attend to agent’s location.
+agent and gem attend to each other (and themselves).
+Generalization capacity is tested in two ways:
+ +Performance on levels that require opening a larger sequence of boxes than it is trained on.
+Performance on levels that require key-lock combinations not seen during training.
+In both the scenarios, the RRL agent significantly outperforms the RL agent.
+RLL agent achieves better or equal results that the RL agent in all but one game.
+For testing generalization, the agent, that was trained for controlling two marines, was transferred on the task which requires it to control 5 marines. These results are not conclusive given the high variability.
+The paper looks at the problem of learning structured exploration policies for training RL agents.
+Link to the paper
+Consider a stochastic, parameterized policy πθ(a|s) where θ represents the policy-parameters.
+To encourage exploration, noise can be added to the policy at each time step t. But the noise added in such a manner does not have any notion of temporal coherence.
+Another issue is that if the policy is represented by a simple distribution (say parameterized unimodal Gaussian), it can not model complex time-correlated stochastic processes.
+The paper proposes to condition the policy on per-episode random variables (z) which are sampled from a learned latent distribution.
+Consider a distibution over the tasks p(T). At the start of any episode of the ith task, a latent variable zi is sampled from the distribution N(μi, σi) where μi and σi are the learned parameters of the distribution and are referred to as the variation parameters.
+Once sampled, the same zi is used to condition the policy for as long as the current episode lasts and the action is sampled from then distribution πθ(a|s, zi).
+The intuition is that the latent variable zi would encode the notion of a task or goal that does not change arbitrarily during the episode.
+The paper focuses on the setting where the structured exploration policies are to be learned while leveraging the learning from prior tasks.
+A meta-learning approach, called as model agnostic exploration with structured noise (MAESN) is proposed to learn a good initialization of the policy-parameters and to learn a latent space (for sampling the z from) that can inject structured stochasticity in the policy.
+General meta-RL approaches have two limitations when it comes to “learning to explore”:
+ +Idea behind MAESN is to meta-train policy-parameters so that they learn to use the task-specific latent variables for exploration and can quickly adapt to a new task.
+An important detail is that the parameters are optimized to maximize the expected rewards after one step of gradient update to ensure that the policy uses the latent variables for exploration.
+For every iteration of meta-training, an “inner” gradient update is performed on the variational parameters and the post-inner-update parameters are used to perform the meta-update.
+The authors report that performing the “inner” gradient update on the policy-parameters does not help the overall learning objective and that the step size for each parameter had to be meta-learned.
+The variation parameters have the usual KL divergence loss which encourages them to be close to the prior distribution (unit Gaussian in this case).
+After training, the variational parameters for each task are quite close to the prior probably because the training objective optimizes for the expected reward after one step of gradient descent on the variational parameters.
+Another implementation detail is that reward shaping is used to ensure that the policy gets useful signal during meta-training. To be fair to the baselines, reward shaping is used while training baselines as well. Moreover, the policies trained with reward shaping generalizes to sparse reward setup as well (during meta-test time).
+Three tasks distributions: Robotic Manipulation, Wheeled Locomotion, and Legged Locomotion. Each task distribution has 100 meta-training tasks.
+In the Manipulation task distribution, the learner has to push different blocks from different positions to different goal positions. In the Locomotion task distributions, the different tasks correspond to the different goal positions.
+The experiments show that the proposed approach can adapt to new tasks quickly and the learn coherent exploration strategy.
+• In some cases, learning from scratch also provides a strong asymptotic performance although learning from scratch takes much longer.
diff --git a/_site/site/2019/06/13/Extrapolating-Beyond-Suboptimal-Demonstrations-via-Inverse-Reinforcement-Learning-from-Observations.html b/_site/site/2019/06/13/Extrapolating-Beyond-Suboptimal-Demonstrations-via-Inverse-Reinforcement-Learning-from-Observations.html new file mode 100644 index 00000000..a2a21f00 --- /dev/null +++ b/_site/site/2019/06/13/Extrapolating-Beyond-Suboptimal-Demonstrations-via-Inverse-Reinforcement-Learning-from-Observations.html @@ -0,0 +1,72 @@ +The paper proposes a new inverse RL (IRL) algorithm, called as Trajectory-ranked Reward EXtrapolation (T-REX) that learns a reward function from a collection of ranked trajectories.
+Standard IRL approaches aim to learn a reward function that “justifies” the demonstration policy and hence those approaches cannot outperform the demonstration policy.
+In contrast, T-REX aims to learn a reward function that “explains” the ranking over demonstrations and can learn a policy that outperforms the demonstration policy.
+The input is a sequence of trajectories T1, … Tm which are ranked in the order of preference. That is, given any pair of trajectories, we know which of the two trajectories is better.
+The setup is to learn from observations where the learning agent does not have access to the true reward function or the action taken by the demonstration policy.
+Reward Inference
+ +A parameterized reward function rθ is trained with the ranking information using a binary classification loss function which aims to predict which of the two given trajectory would be ranked higher.
+Given a trajectory, the reward function predicts the reward for each state. The sum of rewards (corresponding to the two trajectories) is used used to predict the preferred trajectory.
+T-REX uses partial trajectories instead of full trajectories as a data augmentation strategy.
+Policy Optimization
+ +Environments: Mujoco (Half Cheetah, Ant, Hooper), Atari
+Demonstrations generated using PPO (checkpointed at different stages of training).
+Ensemble of networks used to learn the reward functions.
+The proposed approach outperforms the baselines Behaviour Cloning from Observations and Generative Adversarial Imitation Learning.
+In terms of reward extrapolation, T-REX can predict the reward for trajectories which are better than the demonstration trajectories.
+Some ablation studies considered the effect of adding noise (random swapping the preference between trajectories) and found that the model is somewhat robust to noise up to an extent.
+The paper proposes a very cool idea at the intersection of deep learning and physics.
+The idea is to train a neural network architecture that builds on the concept of Hamiltonian Mechanics (from Physics) to learn physical conservation laws in an unsupervised manner.
+It is a branch of physics that can describe systems which follow some conservation laws and invariants.
+Consider a set of N pair of coordinates [(q1, p1), …, (qN, pN)] where q = [q1, …, qN] dnotes the position of the set of objects while p = [p1, …, pN] denotes the momentum of the set of variables.
+Together these N pairs completely describe the system.
+A scalar function H(q, p), called as the Hamiltonian is defined such that the partial derivative of H with respect to p is equal to derivative of q with respect to time t and the negative of partial derivative of H with respect to q is equal to derivative of p with respect to time t.
+This can be expressed in the form of the equation as follows:
+The Hamiltonian H can be parameterized using a neural network and can learn conserved quantities from the data in an unsupervised manner.
+The loss function looks as follows:
+For setups where the energy must be conserved exactly, (eg ideal mass-spring and ideal pendulum), the HNN learn to preserve an energy-like scalar.
+For setups where the energy need not be conserved exactly, the HNNs still learn to preserve the energy thus highlighting a limitation of HNNs.
+In case of two body problems, the HNN model is shown to be much more robust when making predictions over longer time horizons as compared to the baselines.
+In the final experiment, the model is trained on pixel observations and not state observations. In this case, two auxiliary losses are added: auto-encoder reconstruction loss and a loss on the latent space representations. Similar to the previous experiments, the HNN model makes robust predictions over much longer time horizons.
+The paper proposes a dataset to diagnose the abstract reasoning capabilities of learning systems.
+The paper shows that a variant of the relational networks, explicitly designed for abstract reasoning, outperforms models like ResNets.
+Visual reasoning tasks, that are inspired by the human IQ test, are used to evaluate the models in terms of generalization.
+Let’s say that we want to test if the model understands the abstract notion of “increasing”. We could train the model on data that captures the notion of “increasing”, in terms of say increasing size (or quantities) of objects and then test it on a dataset where the notion is expressed in terms of increasing intensity of color.
+The dataset is then used to evaluate if the models can find any solution to such abstract reasoning tasks and how well they generalize when the abstract content is specifically controlled.
+Consists of an incomplete 3x3 matrix of images where the missing image needs to be filled in, typically by choosing from a set of candidate images.
+As such, it is possible to justify multiple answers to be correct though, in practice, the right answer is the one with the simplest explanation.
+Generating RPM like matrices procedurally by building an abstract structure for matrices.
+The abstract structure S consists of 3 components: (i) Relation types (R), (ii) Object types (O) and (iii) Attribute types (A). ie *S = {(r, o, a) | +r in R, o in O and a in A}*. | +
This can be read as: “Structure S is instantiated on attribute a of object o and exhibits the relation r”. For example, S is instantiated on “color” of object “shape” and exhibits the relation “increasing”.
+In general, the structure could be made of more than one such tuple and more are the tuples, harder is the task.
+Given the structure, sample values v for each attribute a while conforming with the relation r. For example, if the attribute is “color” and the relation is “increasing”, the intensity of color must increase.
+The paper tests for the following generalization scenarios:
+Neutral: The structure of the training and test data can contain any tuple.
+Interpolation: The training data contains even-indexed members of the attribute values while the test data contains odd-indexed members of the attribute values.
+Extrapolation: The training data contains first-half of the attribute values while the test data contains the second-half of the attribute values.
+Heldout attribute: Training data contains no tuples with (o = shape, a = color) or (o = line, a = type).
+Heldout triples: Out of 29 possible triples, 7 are held out from training and only used during testing.
+Heldout pair-of-triples: Out of 400 possible sets of pair of triples, 40 were held out and used only during testing.
+Heldout pair-of-triples: Out of 400 possible sets of pair of triples, 40 were held out and used only during testing.
+Heldout attribute pair: Out of 20 (unordered) variable attribute pairs, 4 were held out and used only during testing.
+Input: 8 context panels (from the 3x3) matrix where the last panel needs to be filled.
+CNN-MLP - 4 layer CNN with batchnorm and ReLU.
+ResNet - ResNet-50 (as it perfomed better than ResNet-101 and ResNet-152).
+LSTM
+Wild Relation Network (WReN) - A CNN model encodes the 8 panels and the candidate answers and feeds them as input to a relational network.
+Context-blind ResNet - ResNet network without the context (or the 8 input panels).
+WReN model outperforms the other models on the Neutral setup.
+Models have a harder time differentiating between size than quantity.
+WRen is the best performing models in all the setups and rest of the discussion only applies to that model.
+Generalisation is easy in the context of interpolation while worst in case of extrapolation, hinting at the limited generalization capability of the models.
+The model is also trained to predict the relevant relation, object and attribute types using the meta-targets that encode this information.
+The auxiliary training helps in all the cases. Further, the model’s accuracy on the main task is where in the cases where it solves the auxiliary tasks well.
+For abstract visual reasoning tasks, the choice of models can make a large difference, the case in consideration of ResNets vs Relational Networks.
+Using auxiliary loss that encourages the model to “explain” its reasoning (in this case by predicting the attributes, relations, etc) helps to improve the performance on the main task as well.
+Given that the challenge is motivated by tasks used to measure human IQ, it would have been interesting to get an estimate of human performance on at least a subset of this dataset.
+Consider problems where the input to the model is a set. In such problems (referred to as the set-input problems), the model should be invariant to the permutation of the data points.
+In “set pooling” methods (1, 2), each data point (in the input set) is encoded using a feed-forward network and the resulting set of encoded representations are pooled using the “sum” operator.
+This approach can be shown to be bot permutation-invariant and a universal function approximator.
+The paper proposes an attention-based network module, called as the Set Transformer, which can model the interactions between the elements of an input set while being permutation invariant.
+An attention function Attn(Q, K, V) = (QKT)V is used to map queries Q to output using key-value pairs K, V.
+In case of multi-head attention, the key, query, and value are projected into h different vectors and attention is applied on all these vectors. The output is a linear transformation of the concatenation of all the vectors.
+3 modules are introduced: MAB, SAB and ISAB.
+Multihead Attention Block (MAB) is a module very similar to to the encoder in the Transformer, without the positional encoding and dropout.
+Set Attention Block (SAB) is a module that takes as input a set and performs self-attention between the elements of the set to produce another set of the same size ie SAB(X) = MAB(X, X).
+The time complexity of the SAB operation is O(n2) where n is the number of elements in the set. It can be reduced to O(m*n) by using Induced Set Attention Blocks (ISAB) with m induced point vectors (denoted as I).
+ISABm = MAB(X, MAB(I, X)).
+ISAB can be seen as performing a low-rank projection of inputs.
+These modules can be used to model the interactions between data points in any given set.
+Aggregation is performed by applying multi-head attention on a set of k seed vectors.
+The interaction between the k outputs (from PMA) can be modeled by applying another SAB.
+Thus the entire network is a stack of SABs and ISABs. Both the modules are permutation invariant and so is any network obtained by stacking them.
+Datasets include:
+ +Generally, increasing m (the number of inducing datapoints) improve performance, to some extent. This is somewhat expected.
+The paper considers various ablations of the proposed approach (like disabling attention in the encoder or pooling layer) and shows that attention mechanism is needed during both the stages.
+The work has two main benefits over prior work:
+ +Reducing the O(n2) complexity to O(m*n) complexity.
+Using self-attention mechanism for both encodings the inputs and for aggregating the encoded representations.
+The paper introduces a new, procedurally generated environment called as CoinRun that is designed to benchmark the generalization capabilities of RL algorithms.
+The paper reports that deep convolutional architectures and techniques like L2 regularization, batch norm, etc (which were proposed in the context of generalization in supervised learning) are also useful for RL.
+CoinRun is made of multiple levels.
+In each level, the agent spawns on the far left side and needs to collect a single coin that lies on the far right side.
+There are many obstacles in between and colliding with an obstacle leads to agent’s death.
+Each episode extends for a maximum for 1000 steps.
+CoinRun is designed such that given sufficient training time and levels, a near-optimal policy can be learned for all the levels.
+Generalization can be measure by training an agent on a given set of training tasks and evaluating on an unseen set of test tasks.
+9 agents are trained to play CoinRun, on different training sets (each with a different number of levels).
+The first 8 agents are trained on sets of size 100 to 16000 levels while the last agent is trained on an unbounded set of levels.
+Training a model on an unbounded set of levels provides a good proxy for the train-to-test generalization performance.
+Two convolutional architectures (of different sizes) are compared:
+ +Nature-CNN: The CNN architecture used in the Deep Q Network. This is the smaller network among the two models.
+IMPALA-CNN: The CNN architecture used in the Imapla architecture.
+IMPALA-CNN agent always outperforms the Nature-CNN agent indicating that larger architecture has more capacity for generalization. But increasing the network size beyond a limit gives diminishing returns.
+While both L2 regularization and Dropout helps to improve generalization, L2 regularization is more impactful.
+A domain randomization/data augmentation approach is tested where rectangular regions of different sizes are masked and assigned a random color. This approach seems to improve performance.
+Batch Normalization helps to improve performance as well.
+Environment stochasticity is introduced by using sticky actions while policy stochasticity is introduced by controlling the entropy bonus. Both these forms of stochasticity boost performance.
+While combining different regularization methods help, the gains are only marginally better than using just 1 regularization approach. This suggests that these different approaches induce similar generalization properties.
+Two additional environments are also considered to verify the high degree of overfitting observed in the CoinRun environment:
+ +CoinRun-Platforms:
+ +Unlike CoinRun, each episode can have multiple coins and the time limit is 0increased to 1000 steps.
+Levels are larger as well so the agent might need to backtrack their steps.
+RandomMazes:
+ +Partially observed environment with square mazes of dimensions 3x3 to 25x25.
+Timelimit of 500 steps
+Overfitting is observed for both these environments as well.
+The paper presents a benchmark and experimental protocol (environments, metrics, baselines, training/testing setup) to evaluate RL algorithms for generalization.
+Several RL algorithms are evaluated and the key takeaway is that the “vanilla” RL algorithms can generalize better than the RL algorithms that are specifically designed to generalize, given enough diversity in the distribution of the training environments.
+The focus is on evaluating generalization to environmental changes that affect the system dynamics (and not the goal or rewards).
+Two generalization regimes are considered:
+ +Interpolation - parameters of the test environment are similar to the parameters of the training environment.
+Extrapolation - parameters of the test environment are different from the parameters of the training environment.
+Following algorithms are considered as part of the benchmark:
+ +“Vanilla” RL algorithms - A2C, PPO
+RL algorithms that are designed to generalize:
+ +EPOpt - Learn a (robust) policy that maximizes the expected reward over the most difficult distribution of environments (ones with the worst expected reward).
+RL2 - Learn an (adaptive) policy that can adapt to the current environment/task by considering the trajectory and not just the state transition sequence.
+These specially designed RL algorithms can be optimized using either A2C or PPO leading to combinations like EPOpt-A2C or EPOpt-PPO etc.
+The models are either composed of feedforward networks completely or feedforward + recurrent networks.
+Environments
+ +CartPole, MountainCar, Acrobot, and Pendulum from OpenAI Gym.
+HalfCheetah and Hopper from OpenAI Roboschool.
+Three versions of each environment are considered:
+ +Deterministic: Environment parameters are fixed. This case corresponds to the standard environment setup in classical RL.
+Random: Environment parameters are sampled randomly. This case corresponds to sampling from a distribution of environments.
+Extreme: Environment parameters are sampled from their extreme values. This case corresponds to the edge-case environments which would not be encountered during training generally.
+Performance Metrics
+ +Average total reward per episode.
+Success percentage: Percentage of episodes where a certain goal (or reward) is obtained.
+Evaluation Metrics/Setups
+ +Default: success percentage when training and evaluating the deterministic version of the environment.
+Interpolation: success percentage when training and evaluating on the random version of the environment.
+Extrapolation: the geometric mean of the success percentage of following three versions:
+ +Train on deterministic and evaluate on the random version.
+Train on deterministic and evaluate on extreme version.
+Train on random and evaluate on the extreme version.
+Observations
+ +Extrapolation is harder than interpolation.
+Increasing the diversity in the training environments improves the interpolation generalization of vanilla RL methods.
+EPOpt improves generalization only for continuous control environments and only with PPO.
+RL2 is difficult to train on the environments considered and did not provide a clear advantage in terms of generalization.
+EPOpt-PPO outperforms PPO on only 3 environments and EPOpt-A2C does not
+The paper proposes a new algorithm called as Probabilistic Ensemble with Trajectory sampling (PETS) that combines uncertainty aware deep learning models (ensemble of deep learning models that encode uncertainty) with sampling-based uncertainty propagation.
+PETS improves over other probabilistic MBRL approaches by isolating epistemic uncertainty (due to limited training data) and aleatoric uncertainty (inherent in the system).
+Aleatoric uncertainty can be accounted for by learning a parameterized distribution (probabilistic neural network) trained with negative log-likelihood.
+Epistemic uncertainty can be accounted for by either having an infinite amount of data or by using ensembles.
+The paper uses a neural network to predict the mean and standard deviation of a gaussian distribution which defines the predictive model. This setup is referred to as the “probabilistic” model and denoted by P.
+The alternate setup of the deterministic model is where a neural network is used to make a point prediction (and is denoted by D).
+Ensemble of probabilistic models is denoted as PE while that of deterministic models is denoted as DE.
+Model Predictive Control (MPC) is used for planning.
+Given a start state and an action sequence, the probabilistic dynamics model induces a distribution over the trajectories.
+The first action, among the sequence of optimized actions, is executed.
+Instead of random shooting, Cross Entropy Method (CEM) is used.
+Let us say there are B bootstrap models in the ensemble. Given the current state, P particles are created and each particle is propagated using one of the bootstrap models. Two variants are considered:
+ +TS1 - At each timestep, each particle samples a bootstrap. In this case, particle separation can not be attributed to ti the compounding effects of the bootstraps.
+TS$\infty$ - The bootstrapped model (per particle) is sampled just once and is not changed after that. This setup separates aleatoric and epistemic uncertainty. Aleatoric state variance is the average variance of the particles of the same bootstrap while epistemic state variance is the variance of the average of particles of same bootstrap indexes.
+The proposed approach reaches the asymptotic performance of state-of-the-art model-free algorithms in much fewer samples.
+The general performance trend is probabilistic emsemble > probabilisitc model > deterministc ensemble > determinisitc model./.
+Initial experiments for learning policy by propagating gradients through the ensemble of models did not work and has been left as future work.
+The paper presents the task of abductive NLP (pronounced as alpha NLP) where the model needs to perform abductive reasoning.
+Abductive reasoning is the inference to the most plausible explanation. Even though it is considered to be an important component for understanding narratives, the work in this domain is sparse.
+A new dataset called as Abstractive Reasoning in narrative Text (ART) consisting of 20K narrative contexts and 200k explanations is also provided. The dataset models the task as multiple-choice questions to make the evaluation process easy.
+Given a pair of observations O1 and O2 and two hypothesis h1 and h2, the task is to select the most plausible hypothesis.
+In general, P(h | O1, O2) is propotional to P(h |O1)P(O2|h, O1).
+Different independence assumptions can be imposed on the structure of the problem eg one assumption could be that the hypothesis is independent of the observations or the “fully connected” assumption would jointly model both the observations and the hypothesis.
+Along with crowdsourcing several plausible hypotheses for each observation instance pair, an adversarial filtering algorithm (AF) is used to remove weak pairs of hypothesis.
+Observation pairs are created using the ROCStories dataset which is a collection of short, manually crafted stories of 5 sentences.
+The average word length for both the content and the hypothesis is between 8 to 9.
+To collect plausible hypothesis, the crowd workers were asked to fill in a plausible “in-between” sentence in natural language.
+Given the plausible hypothesis, the crowd workers were asked to create an implausible hypothesis by editing fewer than 6 words.
+Adversarial filtering approach from Zellers et al. is used with BERT as the adversary. A temperature parameter is introduced to control the maximum number of instances that can be changed in each adversarial filtering iteration.
+Human performance: 91.4%
+Baselines like SVM classifier, the bag-of-words classifier (using Glove) and max-pooling overt BiLSTM representation: approx 50%
+Entailment NLI baseline: 59%. This highlights the additional complexity of abductive NLI as compared to entailment NLI.
+BERT: 68.9%
+GPT: 63.1%
+Numerical and spatial knowledge-based data points are particularly hard.
+The model is more likely to fail when the narrative created by the incorrect hypothesis is plausible
+The memory layer is composed of 3 components:
+ +Query Network
+ +Key selection module
+ +Value lookup table
+ +All the parameters are trainable, though, in practice, only the selected k memory slots are updated.
+Using Multihead attention mechanism helps to improve the performance further.
+1 or more feedforward layers in transformers are placed by the memory layers.
+The model is evaluated on large scale language modeling tasks with 140 Gb of data from common crawl corpora (28n billion words).
+Evaluation metrics
+ +Perplexity on the test set.
+Fraction of accessed values.
+KL divergence between the (normalized) weights of key access and uniform distribution.
+The last two metrics are used together to determine how well the keys are utilized.
+Given the large size of the training dataset, adding more layers to the transformer model helps.
+Effect of using memory layer is more powerful than the effect of adding new layers to the transformer. For example, a 12 layer transformer + memory layer outperforms a 24 layer transformer while being almost twice as fast.
+The best position to place the memory is at an intermediate layer and placing the memory layer right after the input or just before the softmax layer does not work well in practice.
+The paper proposes the PHYRE (PHYsical REasoning) benchmark - consisting of classic mechanical puzzles in 2D physical environments - as a means to evaluate the physical reasoning ability of machine learning models.
+2D world that obeys Newtonian mechanics.
+Gravitational force + Friction.
+Non-deformable objects that can be static (ie fixed) or dynamic (ie can move and are affected by collisions etc).
+The learning agent starts in some initial world state (ie configuration of objects).
+Goal is described in the form of (subject
, relation
, object
) where the agent’s task is to satisfy the relation
between the subject
and the object
.
Currently, only the “touch” relation
is supported.
The learning agent has to take a single action - placing one or more new dynamic objects in the world.
+A simulator is run on the new configuration (for a fixed amount of time) to check if the goal condition is satisfied.
+At the end of the simulation, a binary reward and intermediate observations (collected as the simulator executes) are provided to the learning agent.
+These observations are 256*256 grids where each grid cell can take 1 of the 7 values (denoting different types of objects).
+Since only one relation supported currently, the color is sufficient to encode the goal.
+Two benchmark tiers are provided where each tier comprises of a combination of:
+ +a predefined set of all the actions that the agent is allowed to perform.
+set of tasks that can be solved by at least one action from the allowed action set.
+PHYRE-B - The agent is allowed to place a single (ball of any radii) at any valid location.
+PHYRE-2B - The agent is allowed to place 2 balls at any valid pair of locations.
+Each of the two tiers has 25 task templates where each template comprises of variants of a single task (same goal but different initial conditions).
+Two evaluation setups are considered:
+ +within-template where the agent is trained on some tasks in a template and evaluated on a set of held-out tasks from the same template.
+cross-template where the agent is evaluated on tasks from a different template.
+In the training phase, the model has access to the simulator (but not to the correct solution). So the model could learn an action-prediction model or forward dynamics model or both.
+In the testing phase, the model can query the simulator only a few times. Each query provides it with the binary reward and the intermediate observations.
+The emphasis is on solving more tasks (in few queries) during the test phase.
+This requirement is captured using a metric called AUCCESS.
+In general, the tasks in PHYRE-2B are harder than tasks in PHYRE-B.
+Random Agent - Randomly samples actions
+Non-parametric agent (MEM) - generates R actions at random and uses the simulator to check how many tasks can be solved using these R random actions. During testing, try the R actions in the decreasing order of the number of tasks they solve.
+Non-parametric agent with online learning (MEM-O) - Variant of MEM where an online adaptation step is performed during test time (to update the rank of the actions).
+Deep Q Networks with an action encoder, observation encoder and fusion model (combine action and observation representation).
+DQN with online learning (DQN-0): Variant of DQN with online updates (during the test phase).
+Contextual bandits.
+Policy learning approaches like PPO and A2C.
+Both Contextual bandits and policy-based approaches show poor training stability.
+The best agent, DQN-O, reaches AUCCESS of 56.2\% on PHYRE-B and 39.26\% on PHYRE-2B. In general, agents with online adaptation perform better.
+The tasks are designed such that 100000 attempts are sufficient to solve 100\% of tasks in PHYRE-B and 95\% of tasks in PHYRE-2B.
+Even though only two tiers are provided right now, the benchmark is readily extensible and new tasks can be added in the future.
+The paper proposes MAML++ - a modification of MAML algorithm that stabilizes its training improves generalization performance and reduces the computational overhead.
+Training the outer loop requires unfolding the inner loop multiple times.
+In absence of skip connections, the gradient is multiplied by the same parameter multiple times.
+Large depth and absent skip connections could lead to exploding and vanishing gradients respectively.
+The paper proposes to stabilize the gradient propagation by minimizing the target set loss computed by the base-network after every step towards a support set task.
+It is important to anneal the contribution of earlier steps and increase the contribution of later steps over time.
+While the first-order MAML is faster, the resulting model may not have as good a generalization error as the second-order MAML.
+The paper proposes to use derivative order annealing where the first order gradients are used for the first 50 epochs and the network uses second-order gradients from thereon.
+This derivative order annealing appears to be more stable than models that use second-order derivatives only.
+In MAML, the statistics of the current batch are used for normalization instead of accumulating the running statistics.
+The paper proposes to collect the statistics per step which can increase the convergence speed, stability, and generalization performance.
+In MAML, the batch normalization biases are not updated in the inner-loop which can adversely impact the performance.
+The paper proposes to learn a set of biases (per step) within the inner loop update.
+MAML uses a single learning rate across all the steps and all the parameters. This means there is one single learning rate that needs to be hyperparameter to work well for all the layers and steps.
+An alternate solution would be to learn a separate learning rate per parameter but this can be impractical as it doubles the number of parameters to be learned.
+The paper proposes to learn a learning rate and direction for each layer in the network, for each step it takes in the inner loop.
+The paper also proposed to anneal the learning rate of the outer loop (using cosine annealing) as it helps to achieve better generalization.
+Using these modifications helps to outperform the MAML model on both Omniglot and MiniImagenet datasets.
+The biggest benefit comes by learning the per-layer, per-step learning rates and by using the per-step batch normalization.
+The paper considers the task of training an RL system by sampling data from multiple simulators (over parallel devices).
+The setup is that of distributed RL setting with n agents or actor-learners (composed of a single learner and several actors). These agents are trying to maximize a common value function.
+One (existing) approach is to perform on-policy updates with a shared policy. The policy could be updated in synchronous (does not scale well) or asynchronous manner (can be unstable due to stale gradients).
+Off policy approaches allow for better computational efficiency but can be unstable during training.
+The paper proposed Gossip based Actor-Learner Architecture (GALA) which uses asynchronous communication (gossip) between the n agents to improve the training of Deep RL models.
+These agents are expected to converge to the same policy.
+During training, the different agents are not required to share the same policy and it is sufficient that the agent’s policies remain $\epsilon$-close to each other. This relaxation allows the policies to be trained asynchronously.
+GALA approach is combined with A2C agents resulting in GALA-A2C agents. They have better computational efficiency and scalability (as compared to A2C) and similar in performance to A3C and Impala.
+Training alternates between one local policy-gradient (and TD update) and asynchronous gossip between agents.
+During the gossip step, the agents send their parameters to some of the other agents (referred to as the peers) and update their parameters based on the parameters received from the other agents (for which the given agent is a peer).
+GALA agents are implemented using non-blocking communication so that they can operate asynchronously.
+The paper includes the proof that the policies learned by the different agents are within $\epsilon$ distance of each other (ie all the policies lie within an $\epsilon$-distance ball) thus ensuring that the policies do not diverge much from each other.
+Six games from the Ataru 2600 games suite are used for the experiments.
+Baselines: A2C, A3C, Impala
+GALA agents are configured in a directed ring graph topology.
+With A2C, as the number of simulators increases, the number of convergent runs (runs with a threshold reward) decreases.
+Using gossip algorithms increases or maintains the number of convergent runs. It also improves the performance, sample efficiency and compute efficiency of A2C across all the six games.
+When compared to Impala and A3C, GALA-A2C generally outperforms (or performs as well as) those baselines.
+Given that the learned policies remain within an $\epsilon$ ball, the agent’s gradients are less correlated as compared to the A2C agents.
+The paper introduces Contrastively-trained Structured World Models (C-SWMs).
+These models use a contrastive approach for learning representations in environments with compositional structure.
+The training data is in the form of an experience buffer \(B = \{(s_t, a_t, s_{t+1})\}_{t=1}^T\) of state transition tuples.
+The goal is to learn:
+ +an encoder \(E\) that maps the observed states $s_t$ (pixel state observations) to latent state $z_t$.
+a transition model \(T\) that predicts the dynamics in the hidden state.
+The model defines the enegry of a tuple \((s_t, a_t, s_{t+1})\) as \(H = d(z_t + T(z_t, a_t), z_{t+1})\).
+The model has an inductive bias for modeling the effect of action as translation in the abstract state space.
+An extra hinge-loss term is added: \(max(0, \gamma - d(z^{~}_{t}, z_{t+1}))\) where \(z^{~}_{t} = E(s^{~}_{t})\) is a corrputed latent state corresponding to a randomly sampled state \(s^{~}_{t}\).
+The goal is to learn object-oriented representations where each state embedding is structured as a set of objects.
+Assuming the number of object slots to be \(K\), the latent space, and the action space can be factored into \(K\) independent latent spaces (\(Z_1 \times ... \times Z_K\)) and action spaces (\(A_1 \times ... \times A_k\)) respectively.
+There are K CNN-based object extractors and an MLP-based object encoder.
+The actions are represented as one-hot vectors.
+A fully connected graph is induced over K objects (representations) and the transition function is modeled as a Graph Neural Network (GNN) over this graph.
+The transition function produces the change in the latent state representation of each object.
+The factorization can be taken into account in the loss function by summing over the loss corresponding to each object.
+Grid World Environments - 2D shapes, 3D blocks
+Atari games - Pong and Space Invaders
+3-body physics simulation
+Random policy is used to collect the training data.
+Evaluation is performed in the latent space (no reconstruction in the pixel space) using ranking metrics. The observations (to compare against) are randomly sampled from the buffer.
+Baselines - auto-encoder based World Models and Physics as Inverse Graphics model.
+In the grid-world environments, C-SWM models the latent dynamics almost perfectly.
+Removing either the state factorization or the GNN transition model hurts the performance.
+C-SWM performs well on Atari as well but the results tend to have high variance.
+The optimal values of $K$ should be obtained by hyperparameter tuning.
+For the 3-body physics tasks, both the baselines and proposed models work quite well.
+Interestingly, the paper has a section on limitations:
+ +The object extractor module can not disambiguate between multiple instances of the same object (in a scene).
+The current formulation of C-SWM can only be used with deterministic environments.
+The paper presents the MuZero algorithm that performs planning with a learned model.
+The algorithm achieves state of the art results on Atari suite (where generally model-free approaches perform the best) and on planning-oriented games like Chess and Go (where generall planning approaches perform the best).
+Model-based approaches generally focus on reconstructing the true environment state or the sequence of full observations.
+MuZero focuses on predicting only those aspects that are most relevant for planning - policy, value functions, and rewards.
+The model consists of three components: (representation) encoder, dynamics function, and the prediction network.
+The learning agent has two kinds of interactions - real interactions (ie the actions that are actually executed in the real environment) and hypothetical or imaginary actions (ie the actions that are executed in the learned model or the dynamics function).
+At any timestep t, the past observations o1, … ot are encoded into the state st using the encoder.
+Now the model takes hypothetical actions for the next K timesteps by unrolling the model for K steps.
+For each timestep k = 1, …, K, the dynamics model predicts the immediate reward rk and a new hidden state hk using the previous hidden state hk-1 and action ak.
+At the same time, the policy pk and the value function vk are computed using the prediction network.
+The initial hidden state h0 is initialized using the state st
+Any MDP Planning algorithm can be used to search for optimal policy and value function given the state transitions and the rewards induced by the dynamics function.
+Specifically, the MCTS (Monte Carlo Tree Search) algorithm is used and the action at+1 (ie the action that is executed in the actual environment) is selected from the policy outputted by MCTS.
+At each timestep t, the MCTS algorithm is executed to choose the next action (which will be executed in the real environment).
+The resulting next observation ot+1 and reward rt+1 are stored and the trajectory is written to the replay buffer (at the end of the episode).
+For every hypothetical step k, match the predicted policy, value, and reward to the actual target values.
+The target policy is generated by the MCTS algorithm.
+The target value function and reward are generated by actually playing the game (or the MDP).
+MuZero leverages the search-based policy iteration from AlphaZero.
+It extends AlphaZero to setups with a single agent (where self-play is not possible) and setups with a non-zero reward at the intermediate time steps.
+The encoder and the predictions functions are similar to ones used by AlphZero.
+K is set to 5.
+Environments: 57 games in Atari along with Chess, Go and Shogi
+MuZero achieves the same level of performance as AlphaZero for Chess and Shogi. In Go, MuZero slightly outperforms AlphaZero despite doing fewer computations per node in the search tree.
+In Atari, MuZero achieves a new state-of-the-art compared to both model-based and model-free approaches.
+The paper considers a variant called MuZero Reanalyze that reanalyzes old trajectories by re-running the MCTS algorithm with the updated network parameter. The motivation is to have a better sample complexity.
+MuZero performs well even when using a single simulation of MCTS (during inference).
+During training, using more simulations of MCTS helps to achieve better performance through even just 6 simulations per move is sufficient to learn a good model for Ms. Pacman.
+Procedural text comprehension tasks focus on modeling the effect of actions and predicting what happens next.
+But they do not consider why some actions need to happen before other actions.
+The paper proposes a new model called XPAD (eXPlainable Action Dependency) that considers the purpose of actions while predicting their effect.
+The model favors effects that:
+ +explain more of actions in the text.
+are more plausible given the context.
+An existing procedural text benchmark dataset (Propara) is expanded by adding the task of explaining actions by predicting their dependencies.
+Input
+ +Procedural (chronologically ordered text) sequence of T sentences.
+List of N participant entities, whose state changes at some step.
+Output
+ +State change matrix $\pi(T \times N)$ with four possible states - move, create destroy, none.
+This matrix tracks how property changes after each step.
+Dependency Explanation Graph
+ +Identify what steps are necessary to execute a given step (say si) and represent this dependency in the form of a dependency explanation graph G = <S, E>.
+In this graph, each node is a step and the direction of edge describes the order of dependency.
+Propara dataset is expanded to extract the dependency graph using both heuristic and automated methods.
+The automated method is based on the coherence assumption that if step sj changes state of entity ek then sj is a precondition for the first subsequent step that changes the state of ek.
+The model is based on the ProStruct system and uses an encoder-decoder based architecture.
+Encoder
+ +Input: Sentence st and entity ej.
+Sentence is encoded using the GloVe vectors and a BiLSTM model and the entity is encoded as an indicator variable.
+The combined representation is denoted as ctj.
+This representation is passed through an MLP to generate k logits that encode the probability of each entity j undergoing a state change at step t.
+Decoder
+ +Beam search is performed to decode the encoder representation into the state change matrix and dependency graph using a score function that ensures global consistency.
+Score function has two components:
+ +State change score - depends on the likelihood that the selected state changes at step t given the text and state change history from steps s1 to st-1.
+Dependency graph score
+ +This is based on the connectivity and likelihood of the resulting dependency explanation graph.
+This score is used to bias the graph search towards:
+ +predictions that have an identifiable purpose ie checking if a particular state change prediction leads to a connection in the dependency explanation graph.
+graphs that are more likely according to the background knowledge to distinguish likely dependency links from the unlikely ones.
+During training, XPAD has access to the correct path (in the search space) and learns to minimize the joint loss corresponding to predicting the state change and the dependency explanation graph.
+During testing, XPAD performs beam search to predict the most likely state change and dependency explanation graph.
+Tasks:
+ +State change prediction
+Dependency explanation prediction
+Baselines:
+ +XPAD significantly outperforms all the baseline models on the dependency explanation task.
+Improvements on the state change prediction task are less significant.
+Removing dependency graph scores from XPAD leads to a drop in the F1 score.
+The paper provides an elaborate discussion on the different types of errors that the XPAD system makes.
+The paper proposes parameter-reduction techniques to lower the memory consumption (and improve training speed) of BERT.
+It also proposes to use a self-supervised loss (based on inter-sentence coherence) and argues that this loss is better than the NSP loss used by BERT.
+ALBERT architecture is similar to that of BERT with three major differences.
+Factorized Embedding Parameterization
+ +In BERT and followup works, the embedding size was tied to the size of the context vector.
+Since context vector is expected to encoder the entire context, it needs to have a large dimensionality.
+One consequence of this choice is that even the embedding layer (which encodes the representation for each token) has a large size. This increases the overall memory footprint of the model.
+The paper proposed to factorize the embedding parameters into two smaller matrics.
+The embedding layer learns a low dimensional representation of the tokens and this representation is projected into a high dimensional space.
+Cross-layer parameter sharing
+ +Inter-sentence coherence loss
+ +BERT uses two losses - Masked Language Modeling loss (MLM) and Next Sentence Prediction (NSP).
+In the NSP task, the model is provided a pair of sentences and it has to predict if the two sentences appear consecutively in the same document or not. Negative samples are created by sampling sentences from different documents.
+The paper argues that NSP is not effective as a loss function as it merges topic prediction and coherence prediction into one task (as the two sentences come from different documents). The topic prediction is an easier task as compared to coherence prediction.
+Hence the paper proposes to use the Sentence Order Prediction task where the model has to predict which of the two sentences comes first in a document. The negative samples are created by simply swapping the order in the positive samples. Hence both the sentences come from the same document and topic prediction alone can not be used to solve the task.
+Different variants (in terms of size) of ALBERT and BERT models are compared (eg ALBERT, ALBERT-x, BERT-x, etc).
+In general, ALBERT models have many-times fewer parameters as compared to the BERT models.
+Datasets - BookCorpus, English Wikipedia.
+ALBERT-xxlarge significantly outperforms the BERT-large model even though it has around 70% parameters as the BERT-large model.
+BERT-xlarge performs worse than BERT-base hinting that it is difficult to train such large models.
+ALBERT models also have better data throughput as compared to BERT models.
+For the ALBERT models, an embedding size of 128 performs the best.
+As the hidden dimension is increased, the model obtains better performance, but with diminishing returns.
+Very wide ALBERT models (say with a context size of 1024) do not benefit much from depth.
+Using additional training data boosts the performance for most of the downstream tasks.
+The paper empirically shows that using dropout could hurt the performance of the ALBERT models. This observation may not hold for BERT as it does not share parameters across layers and hence may need regularization via dropout.
+ALBERT also improves the state of the art performance on GLUE, SQuAD and RACE benchmarks, for both single-model and ensemble setup.
+The paper studies five different techniques for stat abstraction in MDPs (Markov Decision Processes) and evaluates their usefulness for planning and learning.
+The general idea behind abstraction is to map the actual (or observed) state to an abstract state that should be more amenable for learning.
+It can be thought of as a mapping from one representation to another representation while preserving some useful properties.
+Consider a MDP \(M = <S, A, P, R, \gamma>\) where \(S\) is the finite set of states, \(A\) is finite set of actions, \(P\) is the transition function, \(R\) is the bounded reward function and \(\gamma\) is the discount factor.
+The abstract version of the MDP is \(\widetilde{M} = <\widetilde{S}, A, \widetilde{P}, \widetilde{R}, \gamma>\) where \(\widetilde{S}\) is the finite set if abstract states, \(\widetilde{P}\) is the transition function in the abstract state space and \(\widetilde{R}\) is the bounded reward function in the abstract reward space.
+Abstraction function \(\phi\) is a function that maps a given state \(s\) to its abstract counterpart \(\widetilde{s}\).
+The inverse image \(\phi^{-1}(\widetilde{s})\) is the set of ground states that map to the \(\widetilde{s}\) under the abstraction function \(\phi\).
+A wieghing functioon \(w(s)\) is used to measure how much does a state \(s\) contribute to the abstract state \(\phi(s)\).
+Given two abstraction functions \(\phi_{1}\) and \(\phi_{2}\), \(\phi_{1}\) is said to be finer than \(\phi_{2}\) iff for any states \(s_{1}, s_{2}\) if \(\phi_{1}(s_{1}) = \phi_{1}(s_{2})\) then \(\phi_{2}(s_{1}) = \phi_{2}(s_{2})\).
+This finer relation is reflex, antisymmetric, transitive and partially ordered.
+While many abstractions are possible, not all abstractions are equally important.
+Model-irrelevance abstraction \(\phi_{model}\):
+ +If two states $s_{1}$ and $s_{2}$ have the same abstracted state, then their one-step model is preserved.
+Consider any action \(a\) and any abstract state \(\widetilde{s}\), if \(\phi_{model}(s_{1} = \phi_{model}(s_{2})\) then \(R(s_1, a) = R(s_2, a)\) and \(\sum_{s' \in \phi_{model}^{-1}\widetilde(s)}P_{s_1, s'}^{a} = \sum_{s' \in \phi_{model}^{-1}\widetilde(s)}P_{s_2, s'}^{a}\).
+\(Q^{\pi}\)-irrelevance abstraction:
+ +It preserves the state-action value finction for all the states.
+\(\phi_{Q^{\pi}}(s_1) = \phi_{Q^{\pi}}(s_2)\) implies \(Q^{\pi}(s_1, a) = Q^{\pi}(s_1, a)\).
+\(Q^{*}\)-irrelevance abstraction:
+ +\(a^{*}\)-irrelevance abstraction:
+ +\(\phi_{\pi^{*}}\)-irrelevance abstraction:
+ +In terms of fineness, \(\phi_0 \geq \phi_{model} \geq \phi_{Q^{\pi}} \geq \phi_{Q^*} \geq \phi_{a^*} \geq \phi_{\pi^*}\). Here \(\phi_0\) is the identity mapping ie \(\phi_0(s) = s\)
+If a property applies to any abstraction, it also applies to all the finer abstractions.
+As we go from finer to coarser abstractions, the information loss increases (ie fewer components can be recovered) while the state-space reduces (ie the efficiency of solving the problem increases). This leads to a tradeoff when selecting abstractions.
+For example, with abstractions \(\phi_{model}, \phi_{Q^{\pi}}, \phi_{Q^*}, \phi_{a^*}\), the optimal abstract policy \(\widetilde(\pi)^*\) is optimal in the ground MDP.
+Similarly, if each state-action pair is visited infinitely often and the step-size decays properly, Q-learning with \(\phi_{model}, \phi_{Q^{\pi}}, \phi_{Q^*}\) converges to the optimal state-action value functions in the MDP. More conditions are needed for convergence in the case of the remaining two abstractions.
+For \(\phi_{model}, \phi_{Q^{\pi}}, \phi_{Q^*}, \phi_{a^*}\), the model built with the experience converges to the true abstract model with infinite experience if the weighing function \(w(s)\) is fixed.
+The paper proposes a technique (called Parameter Superposition or PSP) for training and storing multiple models within a single set (or instance) of parameters.
+The different models exist in “superposition” and can be retrieved dynamically given task-specific context information.
+Consider a task with input \(x \in R^N\) and parameter \(W$ \in R^{M \times N}\) where the output (target or features) are given as \(y=Wx\).
+Now consider \(K\) such tasks with parameters \(W_1, W_2, \cdots W_K\).
+If each \(W_k\) requires only a small subspace in \(R^N\), then a linear transformation \(C_k^{-1}\) can be used such that each \(W_kC_k^{-1}\) occupies a mutually orthogonal subspace in \(R^N\).
+The set of parameters \(W_1, \cdots W_K\) can be represented by a single \(W^{M \times N}\) by adding \(W_kC_k^{-1}\).
+The parameter corresponding to the \(k^{th}\) task can be retrived (with some noise) using the context \(C_k\) as \(W^{~}_k = WC_k\)
+Even though the retrieval is noisy, the effect of noise is limited for the context vectors used in the paper.
+Finally, \(\widetilde(y) = \widetilde(W)_{k}x = (WC_{k})x = W(C_{k}x)\)
+Instead of learning \(K\) separate models, only \(K\) context vectors (along with 1 superimposed model) needs to be learned.
+The key assumption is that \(N\) (in \(x \in R^N)\) is large enough such that each \(W_k\) requires only a small subspace of \(R^N\).
+Since images and speech signals tend to occupy a low dimensional manifold, this requirement can be satisfied by over-parameterizing x.
+Rotational Superposition (pspRotation)
+ +Sample rotations uniformly from the orthogonal group \(O(M)\).
+Downside is that if \(M \sim N\), it requires storing as many parameters as learning \(K\) individual models (since \(C\) is of the size of ##M \times M$$).
+Complex Superposition (pspComplex)
+ +The design of rotational superposition can be improved by choosing \(C_k\) to be a diagonal matrix ie \(C_k = diag(c_k)\) where \(c_k\) is a vector of size \(M\).
+Choosing \(c_k\) to be a vector of complex numbers (of the form \(c_{k}^{j} = e^{i\phi_{j}(k)}\) where \(\phi_{j}(k)\) or the phase is sampled uniformly from \([-\pi, \pi]\)) leads to \(C_k\) being a digonal orthogonal matrix.
+Powers of a single context
+ +Binary Superposition (pspBinary)
+ +The parameter superposition principle can be applied to all the linear layers of a network.
+For the convolutional layers, it makes more sense to apply superposition to the convolutional kernel and not to the input image (as the dimensionality of convolutional parameters is smaller than that of inputs).
+For all the experiments, the baseline is a standard supervised learning setup, unless mentioned otherwise.
+The metric is the performance on the previous tasks when the model has been trained on the newer tasks.
+Input Interference
+ +The input distribution changes over time.
+Permuted MNIST dataset is used where each permutation of the pixels corresponds to a new task.
+A new task is sampled every 1000 mini-batches.
+As the network size increases, the performance of Parameter Superposition (psp) outperforms the baseline significantly.
+pspRotation > pspComplex > pspBinary in terms of both performance and the number of additional parameters required for each new task.
+Given that pspBinary is the easiest to implement while being comparable to more sophisticated baselines like Elastic Weight Consolidation (EWC) and Synaptic Intelligence, the paper presents most of the results with the pspBinary model.
+Continous Domain Shift
+ +Rotating-MNIST and Rotating-FashionMNIST tasks are proposed to simulate continuous domain shift.
+In these tasks, the input images are rotated in-plane by a small angle such that the rotation is complete after 1000 steps.
+A new context is assigned after 100 steps as per step changes in the angle would be very small.
+The 10 context vectors used in the first 1000 steps are reused for the subsequent steps.
+Randomly changing the context vector
+ +The paper considers an ablation where the context vector is randomly changed at every step (of the 1000 step cycle). This required the superposition model to store 1000 models.
+This approach is better than the supervised learning baseline but not as good as the proposed psp* models.
+Output Interference
+ +This is the setup where the model transitions from one classification task to another.
+Incremental CIFAR dataset is used with Resnet18 as the base model.
+Baseline is a standard supervised learning model where a new classification head is used for each task (since the classes have a different meaning in each dataset). The model component before the classification layer is shared across the tasks.
+Even though the labels are different across the datasets, the pspBinary model, trained with a single output layer, outperforms the multi-headed baseline.
+Training models with large minibatches (using distributed synchronous SGD) can lead to optimization issues.
+The paper presents techniques for training models with large batch size while matching the accuracy of small minibatch setups.
+The paper focuses on the ImageNet dataset, but many of the proposed ideas are applicable broadly.
+When the minibatch size increases by a factor of k, the learning rate should also be increased by a factor of k (while keeping all other hyperparameters like weight decay fixed).
+Note that this is an empirical rule and is not expected to hold under all conditions.
+One such condition is when the model is changing rapidly during the first few epochs. In this case, a warmup phase is introduced to stabilize the model.
+The paper verifies that the scaling rule is applicable to batch sizes as large as 8K.
+Batch normalization uses batch statistics to normalize the data. Hence, the loss corresponding to each data point (in the batch) is not independent. Thus, changing the batch size could change the underlying function being optimized.
+In the distributed SGD setup, the per-GPU (or per-worker) batch size should be kept constant, and only one worker should compute the batch norm statistics.
+When using weight decay, scaling the cross-entropy loss is not the same as scaling the learning rate.
+When using momentum, changing the learning rate could require “momentum correction.”
+Ensure that the per-worker loss is normalized by the size of the total minibatch and not just by the size of minibatch that each worker sees.
+For each epoch, uses a single random shuffling of the training data (before dividing between the workers).
+The paper describes various techniques to speed up the training pipeline by reducing the communication overhead between nodes. (Each node can have one or more GPUs).
+First, a node sums the gradient from all the GPUs it has.
+The gradients are shared and summed across all the nodes.
+Each node broadcasts the resulting gradient to all the GPUs it has.
+Gradient Aggregation is performed in parallel with the backpropagation operator. While aggregating the gradient for one layer, the system starts computing the gradient of the next layer.
+Using these approaches, a Resnet50 model can be trained on the ImageNet dataset in an hour (using 256 workers).
+When an appropriate warmup strategy is used, the training and the validation curves (for the large batch size setup) matches the corresponding curves for the small batch size setup.
+The best performing warmup strategy is the one where training starts at a learning rate of 0.1 and linearly increases to 3.2 over five epochs.
+The paper shows that the results are not specific to the Resnet50 model (experiments with Resnet101 model) or the use case (experiments with object detection and instance segmentation using Mask R-CNN).
+Along with providing the empirical validation of the proposed ideas, the paper describes all the hyperparameters. It also includes the training and validation curves with the different configurations which enable others to replicate and build on this work.
+The paper investigated two possible reasons behind the usefulness of MAML algorithm:
+ +Rapid Learning - Does MAML learn features that are amenable for rapid learning?
+Feature Reuse - Does the MAML initialization provide high-quality features that are useful for the unseen tasks.
+This leads to a follow-up question: how much task-specific inner loop adaptation is needed.
+In a standard few-shot learning setup, the different datasets have different classes. Hence, the top-most layer (or the head) of the learning model should be different for different tasks.
+The subsequent discussion only applies to the body of the network (ie, network minus the head).
+Freezing Layer Representations
+ +In this setup, a subset (or all) of parameters are frozen (after MAML training) and are not adapted during the representation.
+Even when the entire network is frozen, the performance drops only marginally.
+This indicates that the representation learned by the meta-initialization is good enough to be useful on the test tasks (without requiring any adaptation step).
+Note that the head of the network is still adapted during testing.
+Representational Similarity
+ +In this setup, the paper reports the change in the latent representation (learned by the network) during the inner loop update with a fully trained model.
+Canonical Correlation Analysis (CCA) and Central Kernel Alignment (CKA) metrics are used to measure the similarity between the representations.
+The main finding is that the representations in the body of the network are very similar before and after the inner loop updates while the representations in the head of the network are very different.
+The above two observations indicate that feature reuse is the primary driving factor for the success of MAML.
+When does feature reuse happen
+ +The paper considers the model at different stages of training and compares the similarity in the representation (before and after the inner loop update).
+Even early in training, the CCA similarity between the representations (before and after the inner loop update) is quite high. Similarly, freezing the layers (for the test time update), early in training, does not degrade the test time performance much. This hints that the feature reuse happens early in the learning process.
+The empirical evidence suggests that the success of MAML lies in the feature reuse.
+The authors build on this observation and propose a simplification of the MAML algorithm: ANIL or Almost No Inner Loop Algorithm
+In this algorithm, the inner loop updates are applied only to the head of the network.
+Despite being much more straightforward, the performance of ANIL is close to the performance of MAML for both few-shot image classification and RL tasks.
+Removing most of the inner loop parameters speed up the computation by a factor of 1.7 (during training) and 4.1 (during inference).
+Given that it is possible to remove most of the parameters from the inner loop update (without affecting the performance), the next step is to check if the inner loop update can be removed entirely.
+This leads to the NIL (No Inner Loop) algorithm, which does not involve any inner loop adaptation steps.
+A few-shot learning model is trained - either with MAML or ANIL.
+During testing, the head is removed.
+For each task, the K training examples are fed to the body to obtain class representations.
+For a given test data point, the representation of the data point is compared with the different class representations to obtain the target class.
+The NIL algorithm performs similar to the MAML and the ANIL algorithms for the few-shot image classification task.
+Note that it is still important to use MAML/ANIL during training, even though the learned head is not used during evaluation.
+The paper studies observational overfitting: The phenomenon where an agent overfits to different observation spaces even though the underlying MDP remains fixed.
+Unlike other works, the “background information” (in the pixel space) is correlated with the progress of the agent (and is not just noise).
+Base MDP $M = (S, A, R, T)$ where $S$ is the state space, $A$ is the action space, $R$ is the reward function, and $T$ is the transition dynamics.
+$M$ is parameterized using $\theta$. In practice, it means introducing an observation function $\phi_{\theta}$ ie $M_{\theta} = (M, \phi_{\theta})$.
+A distribution over $\theta$ defines a distribution over the MDPs.
+The learning agent has access to the pixel space observations and not the state space observations.
+Generalization gap is defined as $J_{\theta}(\pi) - J_{\theta^{train}}(\pi)$ where $\pi$ is the learning agent, $\theta$ is the distribution over all the observation functions, $\theta^{train}$ is the distribution over the observation functions corresponding to the training environments. $J_{\theta}(\pi)$ is the average reward that the agent obtains over environments sampled from $M_{\theta}$.
+$\phi_{\theta}$ considers two featurs - generalizable (invariant across $\theta$) and non-generalizable (depends on $\theta$) ie $\phi_{\theta}(s) = concat(f(s), g_{\theta}(s))$ where $f$ is the invariant function and $g$ is the non-generalizable function.
+The problem is set up such that “explicit regularization” can easily solve it. The focus is on understanding the effect of “implicit regularization”.
+LQR is used as a proxy for deep RL architectures given its advantages like enabling exact gradient descent.
+The functions are parameterized as follows:
+ +$f(s) = W_c(s)$
+$g_{\theta}(s) = W_{\theta}(s)$
+Observation at time $t$ , $o_t$, is given as $[W_c W_{\theta}]^{-1} s_t$.
+Action at time $t$ is given as $a_t = K o_{t}$ where $K$ is the policy matrix.
+Dimensionality:
+ +In case of training on just one environment, multiple solutions exist, and overfitting happens.
+Increasing $d_{noise}$ increases the generalization gap.
+Overparameterizing the network decreases the generalization gap and also reduces the norm of the policy.
+The base MDP is the Gym Environment.
+$M_{\theta}$ is generated as before.
+Increasing both width and depth for basic MLPs improves generalization.
+Generalization also depends on the choice of activation function, residual layers, etc.
+In the Gym environment, the actual state is projected to a larger vector and reshaped into an 84x84 tensor (image).
+The image from $f$ is concatenated with the image from $g$. This setup is referred to as the Gym-Deconv.
+The relative order of performance between NatureCNN, IMPALA, and IMPALA-Large (on both CoinRun and Gym-Deconv) is the same as the order of the number of parameters they contain.
+In an ablation, the policy is given access to only $g_{\theta}(s)$, which makes it impossible for the model to generalize. In this test of memorization capacity, implicit regularization seems to reduce the memorization effect.
+The pixel space observation in CoinRun is downsized from 64x64 to 32x32 and flattened into a vector.
+In CoinRun, the dynamics change per level, and the noisy “irrelevant” features change location across the 1D input, making this setup more challenging than the previous ones.
+Overparameterization improves generalization in this scenario as well.
+The paper proposes to build a universal neural machine translation system that can translate between any pair of languages.
+As a concrete instance, the paper prototypes a system that handles 103 languages (25 Billion translation pairs).
+Hypothesis: The learning signal from one language should benefit the quality of other languages1
+This positive transfer is evident for low resource languages but tends to hurt the performance for high resource languages.
+In practice, adding new languages reduces the effective per-task capacity of the model.
+Maximize the number of languages within one model.
+Maximize the positive transfer to low resource languages.
+Minimize the negative interference to high resource languages.
+Perform well ion the realistic, multi-domain settings.
+In-house corpus generated by crawling and extracting parallel sentences from the web.
+102 languages, with 25 billion sentence pairs.
+Compared with the existing datasets, this dataset is much larger, spans more domains, has a good variation in the amount of data available for different language pairs, and is noisier. These factors bring additional challenges to the universal NMT setup.
+Dedicated Bilingual models (variants of Transformers).
+Most bilingual experiments used Transformer big and a shared source-target sentence-piece model (SPE).
+For medium and low resource languages, the Transformer Base was also considered.
+Batch size of 1 M tokes per-batch. Increasing the batch size improves model quality and speeds up convergence.
+The paper compares the following two setups with the baseline:
+ +Combine all the datasets and train over them as if it is a single dataset.
+Combine all the datasets but upsample low resource languages so all that all the languages are equally likely to appear in the combined dataset.
+A target “index” is prepended with every input sentence to indicate which language it should be translated into.
+Shared encoder and decoder are used across all the language pairs.
+The two setups use a batch size of 4M tokens.
+When all the languages are equally sampled, the performance on the low resource languages increases, at the cost of performance on high resource languages.
+Training over all the data at once reverse this trend.
+Temperature based sampling strategy is used to control the ratio of samples from different language pairs.
+A balanced sampling strategy improves the performance for the high resource languages (though not as good as the multilingual baselines) while retaining the high transfer performance on the low resource languages.
+Another reason behind the lagging performance (as compared to bilingual baselines) is the capacity of the multilingual models.
+Some open problems to consider:
+ +Task Scheduling - How to decide the order in which different language pairs should be trained.
+Optimization for multitask learning - How to design optimizer, loss functions, etc. that can exploit task similarity.
+Understanding Transfer:
+ +For the low resource languages, translating multiple languages to English leads to improved performance than translating English to multiple languages.
+This can be explained as follows: In the first case (many-to-one), the setup is that of a multi-domain model (each source language is a domain). In the second case (one-to-many), the setup is that of multitasking.
+NMT models seem to be more amenable to transfer across multiple domains than transfer across tasks (since the decoder distribution does not change much).
+In terms of zero-shot performance, the performance for most language pairs increases as the number of languages change from 10 to 102.
+Sentence Piece Model (SPM) is used.
+Temperature sampling is used to sample vocabulary from different languages.
+Using smaller vocabulary (and hence smaller sub-word tokens) perform better for low resource languages, probably due to improved generalization.
+Low and medium resource languages tend to perform better with higher temperatures.
+The paper proposed a framework for joint modeling of labels and data by interpreting a discriminative classifier p(y|x) as an energy-based model p(x, y).
+Joint modeling provides benefits like improved calibration (i.e., the predictive confidence should align with the miss classification rate), robustness, and out of order distribution.
+Consider a standard classifier $f_{\theta}(x)$ which produces a k-dimensional vector of logits.
+$p_{\theta}(y | x) = softmax(f_{\theta}(x)[y])$
+Uisng concepts from energy based models, we write $p_{\theta}(x, y) = \frac{exp(-E_{\theta}(x, y))}{Z_{\theta}}$ where $E_{\theta}(x, y) = -f_{\theta}(x)[y]$
+$p_{\theta}(x) = \sum_{y}{ \frac{exp(-E_{\theta}(x, y))}{Z_{\theta}}}$
+$E_{\theta}(x) = -LogSumExp_y(f_{\theta}(x)[y])$
+Note that in the standard discriminative setup, shiting the logits $f_{\theta}(x)$ does not affect the model but it affects $p_{\theta}(x)$.
+Computing $p_{\theta}(y | x)$ using $p_{\theta}(x, y)$ and $p_{\theta}(x)$ gives back the same softmax parameterization as before.
+This reinterpreted classifier is referred to as a Joint Energy-based Model (JEM).
+The log-liklihood of the data can be factoized as $log p_{\theta}(x, y) = log p_{\theta}(x) + log p_{\theta}(y | x)$.
+The second factor can be trained using the standard CE loss. In contrast, the first factor can be trained using a sampler based on Stochastic Gradient Langevin Dynamics.
+Datasets: CIFAR10, CIFAR100, SVHN.
+Metrics: Inception Score, Frechet Inception Distance
+JEM outperforms generative, discriminative, and hybrid models on both generative and discriminative tasks.
+A calibrated classifier is the one where the predictive confidence aligns with the misclassification rate.
+Dataset: CIFAR100
+JEM improves calibration while retaining high accuracy.
+One way to detect OOD samples is to learn a density model that assigns a higher likelihood to in-distribution examples and lower likelihood to out of distribution examples.
+JEM consistently assigns a higher likelihood to in-distribution examples.
+The paper also proposes an alternate metric called approximate mass to detect OOD examples.
+The intuition is that a point could have likelihood but be impossible to sample because its surroundings have a very low density.
+On the other hand, the in-distribution data points would lie in a region of high probability mass.
+Hence the norm of the gradient of log density could provide a useful signal to detect OOD examples.
+Use of replay buffer (and rehearsal) is a common technique for mitigating catastrophic forgetting.
+The paper builds on this idea but focuses on the sample selection aspect ie, which data points to store in the replay buffer.
+It formulates sample selection as a constraint minimization problem and shows that the proposed formulation is equivalent to maximizing the diversity of the samples with respect to parameter gradient.
+Supervised learning tasks
+Online stream of data (i.e., one or few datapoints accessed at a time).
+When considering the $t^{th}$ task, the objective is: minimize the loss on the current task without increasing the loss on any of the previous tasks.
+The above constraint can be rephrased as $dot(g_t, g_i) \gt 0 \forall i \in [0, t-1]$ where $g_t$ is the gradient for the $t^{th}$ task.
+This is equivalent to saying that the current task gradient should not interfere negatively with the previous task gradient.
+In practice, the gradient constraint is enforced only over the examples in the minibatch (and not the full dataset).
+The paper interprets the constraint satisfaction problem as approximating an optimal feasible region (in the gradient space) where current task performance can be improved without hurting the performance on the previous tasks.
+The approximate region (of the shape of a polyhedral convex cone) is determined using only the examples from the replay buffer. Hence, the optimal region (defined for the entire dataset) would be contained within the approximate region.
+The size of the approximate region can be measured in terms of the solid angle defined by the intersection between the approximate region and a unit sphere.
+The paper argues that the approximate region can be made smaller by reducing the angle between each pair of gradients.
+The set of points, satisfying the constraint, can be computed using the Integer Quadratic Programming (IQP).
+Given that the problem setup is online learning, using IDP for every new data point is not feasible.
+An in-exact, greedy alternative is suggested where a score is maintained for each example in the buffer.
+When a new datapoint comes in, the score is computed and used to decide if the existing datapoint in the buffer should be replaced.
+The score is the maximal cosine similarity of the current example with a random sample in the buffer.
+Benchmarks
+ +Disjoint MNIST
+Permuted MNIST
+Disjoint CIFAR10
+Shared head setup
+Baselines for sample selection
+ +Randomly select examples to keep in the buffer.
+Perform clustering - either in the feature space or in the gradient space.
+Use IQP to select the examples. This approach is not used for CIFAR10, as it is computationally costly.
+It would be interesting if the paper had considered baselines like selecting samples which had the largest loss.
+The proposed greedy approach outperforms the other methods.
+In an ablation experiment, the paper shows that the proposed approach works better than reservoir sampling (when the underlying data distribution is imbalanced).
+Another experiment compares the proposed approach with Gradient Episodic Memory and iCaRL. For Permuted and Disjoint MNIST, the different methods perform quite similar though the proposed approach performs better on Disjoint CIFAR10.
+Masked Language Modeling (MLM) is a common technique for pre-training language-based models. The idea is to “corrupt” some tokens in the input text (around 15%) by replacing them with the [MASK] token and then training the network to reconstruct (or predict) the corrupted tokens.
+Since the network learns from only about 15% of the tokens, the computational cost of training using MLM can be quite high.
+The paper proposes to use a “replaced token detection” task where some tokens in the input text are replaced by other plausible tokens.
+For each token in the modified text, the network has to predict if the token has been replaced or not.
+The alternative token is generated using a small generator network.
+Unlike the previous MLM setup, the proposed task is defined for all the input tokens, thus utilizing the training data more efficiently.
+The proposed approach is called ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately)
+Two neural networks - Generator (G) and Discriminator (D) are trained.
+Each network has a Transformer-based text encoder that maps a sequence of words into a sequence of vectors.
+Given an input sequence x (of length N), k indices are chosen for replacing the tokens.
+For each index, the generator produces a distribution over tokens. A token is sampled to replace in the original sequence. The resulting sequence is referred to as the corrupted sequence.
+Given the corrupted sequence, the Discriminator predicts which token comes from the data distribution and which comes from the generator.
+The generator is trained using the MLM setup, and the Discriminator is trained using the discriminative loss.
+After pre-training, only the Discriminator is finetuned on the downstream tasks.
+Datasets
+ +GLUE Benchmark
+Stanford QA dataset
+Architecture Choices
+ +Sharing word embeddings between generator and Discriminator helps.
+Tying all the encoder weights leads to marginal improvement but forces the generator and the Discriminator to be of the same size. Hence only embeddings are shared.
+Generator model is kept smaller than the discriminator model as a strong generator can make the training difficult for the Discriminator.
+A two-stage training procedure was explored where only the generator is trained for n steps. Then the weights of the generator are used to initialize the Discriminator. The Discriminator is then trained for n steps while keeping the generator fixed.
+This two-stage setup provides a nice curriculum for the Discriminator but does not outperform the joint training based setup.
+An adversarial loss based setup is also explored but it does not work well probably because of the following reasons:
+ +Adverserially trained generator is not as good as the MLM generator.
+Adverserially trained generator produces a low entropy output distribution.
+Results
+ + +Ablations
+ +ELECTRA-15 is a variant of ELECTRA where the Discriminator is trained on only 15% of the tokens (similar to the MLM setup). This reduces performance significantly.
+Replace MLM setup
+ +Perform MLM training, but instead of using [MASK], use a toke sampled from the generator.
+This improves the performance marginally.
+All-token MLM
+ +In the MLM setup, replace the [MASK] token by the sampled tokens and train the MLM model to generate all the words.
+In practice, the MLM model can either generate a word or copy the existing word.
+This approach closes much of the gap between BERT and ELECTRA.
+Interestingly, ELECTRA outperforms All-token MLM BERT suggesting the ELECTRA may be benefitting from parameter efficiency since it does not have to learn a distribution over all the words.
+The paper proposes a simple and dataset-agnostic data augmentation mechanism called mixup.
+Consider two training examples, $(x_1, y_1)$ and $(y_1, y_2)$, where $x_1$ and $x_2$ are the datapoints and $y_1$ and $y_2$ are the labels.
+New training examples of the form $(\lambda \times x_1 + (1-\lambda) \times x_2, \lambda \times y_1 + (1-\lambda) \times y_2)$ are constructured by considering the linear interpolation of the datapoints and the labels. Here $\lambda \in [0, 1]$.
+$\lambda$ is sampled from a Beta distribution $Beta(\alpha, \alpha)$ where $\alpha \in (0, \infty)$.
+Setting $\lambda$ to 0 or 1 eliminates the effect of mixup.
+Mixup encourages the neural network to favor linear behavior between the training examples.
+Supervised Learning
+ +ImageNet for ResNet-50, ResNet-101 and ResNext-101.
+CIFAR10/CIFAR100 for PreAct ResNet-18, WideResNet-28-10 and DenseNet.
+Google command dataset for LeNet and VGG.
+In all these setups, adding mixup improves the performance of the model.
+Mixup makes the model more robust to noisy labels. Moreover, mixup + dropout improves over mixup alone. This hints that mixup’s benefits are complementary to those of dropout.
+Mixup makes the network more robust to adversarial examples in both white-box and black-box settings (ImageNet + Resnet101).
+Mixup also stabilizes the training of GANs by acting as a regularizer for the gradient of the discriminator.
+Convex combination of three or more examples (with weights sampled from a Dirichlet distribution) does not provide gains over the case of two examples.
+In the authors’ implementation, mixup is applied between images of the same batch (after shuffling).
+Interpolating only between inputs, with the same labels, did not lead to the same kind of gains as mixup.
+The paper is among the first to study image classification at a large scale (10000 classes and 9 million examples).
+This is a relatively old paper (2010). Some of the findings may not be relevant anymore. For instance, specific scaling challenges have been significantly overcome. Moreover, the paper uses approaches like SVM and KNN (popular at that time) and not use CNNs.
+Other observations of the paper are still very relevant, and it is an educating paper. For example, since ImagetNet classes are based on WordNet, the paper looks at the effect of semantic relations (tree) of categories on the performance of the training models.
+The paper considers three variants of the ImageNet dataset - ImageNet 10K (10184 classes), ImageNet 7K (7404 classes) and ImageNet 1K (1000 classes).
+They also consider smaller variants with randomly sampled classes or cases where the examples are sampled from one high-level category like vehicles.
+SVM and KNN models are used with features like Bag of Words, GIST descriptors, and spatial pyramid of histograms.
+Observations
+ +A model that performs well on the smaller dataset (with fewer classes) may not perform well on the larger dataset (with more classes).
+There seems to be an approximate correlation between the structure of the semantic hierarchy of the labels (obtained via WordNet) and visual confusion between the categories.
+For example, consider two high-level concepts - says artifacts and animals. The model is less likely to confuse between the classes across the high-level concepts but more likely to confuse between the classes in the respective concepts.
+For dense categories (categories where the classes are semantically more closely related to each other), the model tends to make more mistakes (even if the number of classes is fewer).
+Accounting for the label hierarchy (in the loss function) improves the classification performance.
+The paper proposes a Competitive training mechanism to train a mixture of independent generative models.
+The idea is that this mixture of different models would divide the data distribution amongst themselves and specialize to their respective splits.
+The training procedure is related to clustering-based methods.
+In causal modeling, a common assumption is that the data is generated by a set of independent mechanisms.
+It is not known which mechanism generates which datapoint and recovering the underlying mechanisms can be modeled as learning a structural causal generative model.
+The paper assumes that the support of the different generators do not overlap, i.e., the underlying data distribution is factorized into non-overlapping regions.
+This data factorization is learned using a set of discriminators.
+If there are $k$ generators, $k$ binary partition functions $c_i, … c_k$ are used.
+For a given datapoint $x$, if $c_i(x) = 1$ then $c_j(x) = 0$ for all other $j$ and $x$ is assigned to $i^{th}$ generator.
+For a fixed partition function $c_j^t$ ($t$ denotes the partition function at time $t$), minimize the sum of f-divergence between the model and the data distribution (that is assigned to it). The loss formulation is an upper bound on the f-divergence of the mixture model.
+In the next step, the data points are re-assigned to the generative models, based on the likelihood of each data point for each model.
+The likelihood is estimated by training a discriminator that can distinguish the generated samples from the real samples.
+The independence assumption may be too restrictive because the low-level features will be common across the distribution splits.
+This “violation” can be avoided by pretraining the model using a uniform random split of the dataset. In that case, the independence assumption will hold approximately after pretraining.
+Another approach could be to share some parameters across the models.
+A “load balancing” approach is also used where each model always keeps training on the data points assigned to it if not enough data points are assigned to it.
+VAEs tend to be “overly inclusive” of the training distribution, i.e., they try to cover the entire support of the distribution.
+GANs are prone to mode collapse where the model focuses only on one part of the distribution.
+The proposed method provides a middle ground where the different generative models can focus on different parts of the distribution.
+The experiments seem to be limited. The paper shows that their proposed setup improves over the VAE and GAN baselines.
+For datasets, the paper uses two-dimensional synthetic data, MNIST and CelebA
+The paper proposes a contrastive learning approach, called CURL, for performing off-policy control from raw pixel observations (by transforming them into high dimensional features).
+The idea is motivated by the application of contrastive losses in computer vision. But there are additional challenges:
+ +The learning agent has to perform both unsupervised and reinforcement learning.
+The “dataset” for unsupervised learning is not fixed and keeps changing with the policy of the agent.
+Unlike prior work, CURL introduces fewer changes in the underlying RL pipeline and provides more significant sample efficiency gains. For example, CURL (trained on pixels) nearly matches the performance of SAC policy (trained on state-based features).
+CURL uses instance discrimination. Deep RL algorithms commonly use a stack of temporally consecutive frames as input to the policy. In such cases, instance discrimination is applied to all the images in the stack.
+For generating the positive and negative samples, random crop data augmentation is used.
+Bilinear inner product is used as the similarity metric as it outperforms the commonly used normalized dot product.
+For encoding the anchors and the samples, InfoNCE is used. It learns two encoders $f_q$ and $f_k$ that transform the query (base input) and the key (positive/negative samples) into latent representations. The similarity loss is applied to these latents.
+Momentum contrast is used to update the parameters ($\theta_k$) of the $f_k$ network. ie $\theta_k = m \theta_k + (1-m) \theta_q$. $\theta_q$ are the parameters of the $f_q$ network and are updated in the usual way, using both the contrastive loss and the RL loss.
+DMControl100K and Atart100K refer to the setups where the agent is trained for 100K steps on DMControl and Atari, respectively.
+Metrics:
+ +Sample Efficiency - How many steps does the baseline need to match CURL’s performance after 100K steps.
+Performance - Ratio of episodic returns by CURL vs. the baseline after 100K steps.
+Baselines:
+ +DMControl
+ + +Atari
+ +Results
+ +DM Control
+ +CURL outperforms all pixel-based RL algorithms by a significant margin for all environments on DMControl and most environments on Atari.
+On DMControl, it closely matches the performance of the SAC agent trained on state-space observations.
+On Atari, it achieves better median human normalizes score (HNS) than the other baselines and close to human efficiency in three environments.
+The paper builds on the prior work on self-supervised contrastive learning and extends it for the supervised learning case where many positive examples are available for each anchor.
+This module transforms the input example. The paper considers the following strategies:
+ +* This module maps the input to a latent representation.
+
+* The same network is used to encode both the anchor and the sample.
+
+* The representation vector is normalized to lie on the unit hypersphere.
+
* This module maps the normalized representation to another representation, on which the contrastive loss is computed.
+
+* This network is only used for training the supervised contrastive loss.
+
* The paper extends the standard contrastive loss formulation to handle multiple positive examples.
+
+* The main effect is that the modified loss accounts for all the same-class pairs (from within the sampled batch as well as the augmented batch).
+
+* The paper shows that the gradient (corresponding to the modified loss) causes the learning to focus more on hard examples. "Hard" cases are the ones where contrasting the anchor benefits the encoder more.
+
+* The proposed loss can also be seen as a generalization of the triplet loss.
+
Dataset - ImageNet
+Models - ResNet50, ResNet200
+The network is “pretrained” using supervised contrastive loss.
+After pre-training, the projection network is removed, and a linear classifier is added.
+This classifier is trained with the CE loss while the rest of the network is kept fixed.
+Using supervised contrastive loss improves over all the baseline models and data augmentation approaches.
+The resulting classifier is more robust to image corruptions, as shown by the mean Corruption Error (mCE) metric on the ImageNet-C dataset.
+The model is more stable to the choice oh hyperparameter values (like optimizers, data augmentation, and learning rates).
+Supervised Contrastive loss is trained for 700 epochs during pre-training.
+Each step is about 50% more expensive than performing CE.
+The dense classifier layer can be trained in as few as ten epochs.
+The temperature value is set to 0.07. Using a lower temperature is better than using a higher temperature.
+The paper considers learning scenarios where the training data is available incrementally (and not at once).
+For example, in some applications, new data is available periodically (e.g., latest news articles come out every day).
+The paper highlights that, in such scenarios, the conventional wisdom of “warm start” does not apply.
+When new data is available, it is better to train a new model from scratch than to update the model trained on previously available data.
+While the two setups lead to similar training performance, the randomly initialized model has a much better generalization performance.
+Create two random, equally-sized partitions of the training data.
+Train the model till convergence on the first half of the data. Then train the model on the entire dataset.
+Models: ResNet18, MLPs, Logisitic Regression (LR)
+Dataset: CIFAR10, CIFAR100, SVHN
+Optimizers: Adam, SGD
+Warm starting hurts generalization in all the cases.
+The effect is more pronounced in the case of ResNets and MLPs (compared to LR) and harder CIFAR 10 dataset (as compared to SVHN dataset).
+The model is given access to k new learning examples at each iteration.
+A warm started model reuses the previously initialized model and trains (till convergence) on the new batch of k items.
+A “randomly initialized” model is trained on all the examples (seen so far) from scratch.
+Dataset: CIFAR10
+Model: ResNet18
+As more training data becomes available, the generalization gap between the two setups increases, and warmup starts hurting generalization.
+In this setup, the learner is trained to sample k new examples to add to the training dataset (using margin-based sampling).
+Like the previous setup, warmup strategy still hurts generalization.
+Train a Resnet18 model on the CIFAR10 dataset and use this model to warm start training on the SVHN dataset.
+When a small percentage of the SVHN dataset is used, the setup resembles pretraining / transfer learning and performs better than training from scratch.
+As the percentage of the SVHN dataset increases, the warmup approach starts underperforming.
+ResNet18 model on CIFAR10 dataset
+When performing a hyper-parameter sweep over the learning rate and batch size, it is possible to train warm start models to reach the same generalization performance as training from scratch.
+Though, in that case, there are no computational savings as the warm-started models take about the same time (to converge) as the randomly initialized model.
+The increased training time indicates that the warm started model probably needs to forget the knowledge from previous training rounds.
+Warm start Resnet models, that generalize well, have a low correlation to their initialization stage (measured via Pearson correlation coefficient between the model weights).
+Generalization is damaged even when using a model trained on incomplete data for only a few epochs.
+For warm start models, the gradient (corresponding to the “new” data) is higher than that for randomly initialized models. This hints that regularisation may help to close the generalization gap. But in practice, regularization helps both the warmup and randomly initialized model.
+Warm starting only a few layers also does not close the gap.
+Adding some noise to the warm started model (with the motivation of having a partially random initialization) does help somewhat but also increases the training time.
+Motivating the problem as an instance of catastrophic forgetting, the authors use the EWC algorithm but report that using EWC hurts model performance.
+The paper does not propose a solution to the problem but provides a thorough analysis of the problem setup, which is quite useful for understanding the phenomenon itself.
+The paper proposed a Technique for improving the generalization ability of RL agents when evaluated on an unseen environment (which is similar to the training environment).
+The key idea is to learn features that are invariant across environments by using a randomized CNN (f) that randomly perturbs the inputs.
+The policy is trained using the randomized observations obtained using f.
+Invariant features are learned using a feature matching (FM) loss that matches the feature representation of the original and randomized observations.
+The random network’s parameters are initialized as $\alpha I + (1 - \alpha) N(0, \sqrt\frac{2}{n_{in} + n_{out}})$ where $\alpha \in [0, 1]$, $N$ denotes the Gaussian Distribution and $n_{in}, n_{out}$ denote the number of input and output channels respectively.
+Xavier Normal distribution is used for randomization to maintain the variance between the input and the randomized input.
+f is randomized per iteration.
+During inference, the expected action is computed by approximating over M samples (i.e., randomizing the input M times).
+2D CoinRun, 3D DeepMind Lab, 3D Robotics Control Task
+The evaluation environments consist of different styles of backgrounds, objects, and floors.
+Regularization methods: Dropout, L2 regularization, Batch Normalization
+Dataset Augmentation methods: Cutout, Gray out, Inversion, Color Jitter
+On CoinRun, the proposed approaches significantly outperforms the other baselines during evaluation. The performance improvement saturates around 10 M samples.
+Cycle consistency is used to measure the similarity between two trajectories. The proposed method improves the cycle consistency as compared to the vanilla PPO baseline. It also produces sharper activation maps in the evaluation environments.
+For the large-scale experiments, when evaluated on 500 levels of CoinRun, the proposed method improves the success rates from 39.8% to 58.7%.
+On DeepMind Lab and Surreal robotics control tasks, the proposed method leads to agents that generalize better on the unseen environments (during evaluation).
+The paper compares replay-based approaches with model-based approaches in Reinforcement Learning (RL).
+It hypothesizes that if the parametric model is only used for generation transitions for the update rule, then under certain conditions, replay-based approaches will be as good as model-based approaches.
+Planning: Any algorithm that uses additional computations (but not additional experience) to improve its performance.
+Learning: Any algorithm that uses additional experience to improve its performance.
+In some cases, a replay buffer can be seen as a model. For example, querying using state-action pair (from the replay buffer) is similar to querying the (expected) next-state and reward from a model. In general, the model will be more flexible as any arbitrary state-action pair can be used for querying.
+Parametric models require more computation than sampling from a replay buffer. In contrast, the cost of maintaining a replay buffer scales linearly with their capacity.
+Parametric models are useful for planning multiple-steps into the future while it is much harder to do so with a replay buffer (even more so with pixel observations).
+An imperfect model maybe be more suitable for selecting actions (instead of updating the policy) because the chosen action, when executed in the environment, will lead to transitions that would improve the model.
+When planning with an imperfect model, it is better to plan backward, as the update is applied on an imaginary state (which would not be encountered if the model is poor).
+If the model is accurate, forward and backward planning is equivalent. This distinction between forward and backward updates does not apply to replay buffers.
+When using a replay buffer and (i) uniformly replaying transitions, (ii) from a buffer containing only full episodes, and (iii) using TD updates, then the algorithm is stable.
+When using a replay buffer and (i) uniformly replaying transitions, (ii) generating transitions using a model, and (iii) using TD updates, then the algorithm can diverge.
+This case can be fixed by:
+ +Repeatedly interating over the model and sampling transitions to and from the state model generates (not a satisfactory solution).
+Using multiple-step returns (this can increase the variance).
+Use algorithms specifically for stable off-policy learning (not a definitive solution).
+The paper compares against SimPLe (model-based) with Rainbow DQN (replay-based).
+The paper shows that when using a similar number of real interactions, Rainbow DQN needs fewer replay samples than model samples in SimPLe, making it more efficient (computation-wise).
+When using a parametric model in a replay-like setting (sampling observed states from the past), model-based learning can be unstable (in theory). Using a replay buffer is likely a better strategy under the state sampling distribution.
+Parametric models are likely more useful when:
+The paper explores the connections between the concepts of a single agent vs. society of agents.
+A society of agents can be modeled as a single agent while a single agent can be modeled as a society of components (or sub-agents).
+The paper focuses on mechanisms for training a society of self-interested agents to solve a given task – as if the system was a single task.
+Societal-decision making framework relates the local optimization problem of a single agent with the global optimization problem of a society of agents.
+Cloned Vickrey Society is proposed as a mechanism to guarantee that an agent’s dominant strategy equilibrium coincides with the group’s optimal policy.
+A class of decentralized RL algorithms that optimize the MDP object of the society as a whole, as a consequence of individual agents optimizing their objectives.
+Empirical evaluation of Cloned Vickrey Society using any implementation called Credit Conserving Vickery.
+Environment - a tuple that specifies an input space, an output space, and parameters for determining an objective.
+ +Agent - a function that maps input space to output space.
+Objective - a functional that maps an agent to a real number.
+In auction environments, the input space is a single auction item (say s), and the output space is bidding space B.
+There are N agents who compete by bidding for an item s using their bidding policy.
+$b$ is a vector of bids produced by the agents.
+$v_s$ is a vector of agent’s valuations of item s.
+The $i^{th}$ agent’s utility is given as $v_s^i \times X^i(b) - P^i(b)$. Here, $X^i(b)$ is the portion of $s$ allocated to $i^{th}$ agent and $P^i(b)$ is the price that $i^{th}$ agent is willing to pay.
+Each agent is independently maximizing its utility.
+In certain conditions (i.e., if the auction is dominant strategy incentive compatible), it is optimal for each agent to bid its valuation.
+These conditions are satisfied by the Vickery auction where $P^i(b)$ is set to be the second-highest bid and $X^i(b) = 1$ if the $i^{th}$ agent wins (and 0 otherwise).
+A society is a set of agents where each agent is a tuple of bidding policy $\psi$ and a transformation function.
+The environment is modeled at two levels - (i) global environment (referred to as the global MDP) and local environment (referred to as local auction).
+Each state $s$ in the global MDP is an auction item in a different auction. The winner (of local auction at $s$) transforms $s$ into some other state $s’$.
+If these transformations are modeled as actions, then the proposed framework can be interpreted as a decentralized reinforcement learning framework.
+Motivated by the design of market economy (where economic transactions determine wealth distribution), the paper proposes that, for an agent, the valuation of winning an auction is the revenue it can receive in the auction at the next timestep by selling the transformed state.
+A global MDP that adhere to this design is referred to as the Market MDP.
+There is a catch in the design of the market MDP - the winning agent, at time $t-1$, gets the amount that the highest bidder is willing to pay at time $t+1$. But the winner at time $t+1$ only paid the second-highest bid. Hence, the credit is not conserved.
+This inconsistency can be fixed by introducing “duplicate” (or cloned) agents, and the society is called the Cloned Vickery Society.
+The Cloned Vickrey Auction mechanism is compared against alternate bidding mechanisms like first price auction (where winner pays the bid they proposed), solitary version of Vickrey auction (no cloning), and Environment Reward where only environment reward is used, and there is no price term.
+It is empirically shown that Cloned Vickrey Auction learns bids that are most close to their actual valuations. Moreover, solitary version leads bids which are more spread out than the ones learned by cloned version. This highlights the importance of competitive pressure to learn bid values.
+Three different implementations of Cloned Vickrey Auction are considered:
+ +Bucket Brigade (BB) - winner at timestep $t$ receives the highest bid at time step $t+1$, and the subsequent winner pays the highest bid. This case satisfies Credit Conservation and Bellman Optimality.
+Vickrey (V) - winner at timestep $t$ receives the highest bid at time step $t+1$, and the subsequent winner pays the second-highest bid. This case satisfies Truthful Dominant Strategy and Bellman Optimality.
+Credit Conserving Vickrey (CCV) - winner at timestep $t$ receives the second-highest bid at time step $t+1$, and the subsequent winner pays the second-highest bid. This case satisfies Truthful Dominant Strategy and Credit Conservation.
+CCV implementation provides bid values closest to the optimal Q-values.
+In one experiment, the paper explores the use of the proposed approach for selecting between sub-policies. It shows that CVV is more sample efficient for pretraining sub-policies and adapting them to transfer tasks.
+In another experiment, the task is to transform MNIST images by composing two out of 6 affine transformations. The transformed images are fed to a pretrained classifier that predicts a label. The agent gets a reward of 1 if the classifier makes correct prediction and 0 otherwise. CCV implementation obtains a mean reward of 0.933, thus highlighting the effectiveness of the CCV model.
+The paper proposes Stochastic Weight Averaging (SWA) procedure for improving the generalization performance of models trained with SGD (with cyclic or constant learning rate).
+Specifically, the model is checkpointed at several points along the training trajectory, and these checkpoints are averaged (in the parameter space) to obtain a single model.
+“Stochastic” in the name refers to the idea that with cyclical or constant learning rate, SGD proposals are approximately sampled from a neural network’s loss surface and are hence stochastic.
+SWA uses a learning rate schedule that allows exploration in the weight space.
+SGD with cyclical and constant learning rates explore points (model instances) at the periphery of high-performing networks.
+With different initializations, SGD will find different points (of low training loss) on this boundary, but will not move inside it.
+Averaging the points provide a mechanism to move inside this periphery.
+The train and the test error surfaces, while being similar, are not perfectly aligned. Hence, averaging several models (along the optimization trajectory) could lead to a more robust model.
+Given a model $w$ and some training budget $B$, train the model in the conventional way for approx 75% of the budget.
+Starting from that point, continue training with the remaining budget, with a constant or cyclical learning rate.
+For fixed learning rate, checkpoint models at each epoch. For cyclical learning rate, checkpoint the model at the lowest learning rate in the cycle.
+Average all the models to get the SWA model.
+If the model has Batch Normalization layers, run an additional pass to compute the SWA model’s running mean and standard deviation.
+The computational and space complexity of computing the SWA model is relatively low.
+The paper highlights the ensembling like the effect of SWA by showing that if the model checkpoints ($w_i$) are generated by training with Fast Geometric Ensembling (FGE), the difference between averaging the weights and averaging the predictions is of the order $O(\Delta)$ where $\Delta = max ||w_i - w_{SA}||$.
+Note that SWA does not have the overhead of an extra-forward pass during inference.
+Datasets: CIFAR10, CIFAR100, ImageNet
+Models: VGG16, WideResNet, 164-layer preactivation ResNet, ShakeShake, Pyramid Net.
+Baselines: Conventional SGD, Exponentially decaying average with SGD and FGE.
+In all the CIFAR experiments, SWA consistently outperforms SGD in one budget and consistently improves with training.
+SWA also achieves performance comparable to FGE, despite FGE being an ensemble method.
+On ImageNet, SWA is run on a pre-trained model, and it improves performance in all the cases.
+An ablation experiment (on CIFAR-100) shows that it is possible to train a network (with SWA) using a fixed learning rate. In that setup, using SWA improves performance by 16%.
+Meta-learning techniques are shown to benefit from the use of deep neural networks.
+BatchNorm is a commonly used component when training deep networks, especially for vision tasks.
+However, BatchNorm and meta-learning make contradictory assumptions, and their combination may not work well in practice.
+The paper proposes TaskNorm, a normalization method that is designed explicitly for meta-learning.
+Standard meta-learning setup with $k$ tasks, each task with its own context and target set.
+Two sets of parameters are considered during meta-learning - (i) global parameters, and (ii) task-specific parameters.
+Meta-learning setup can be viewed as an inference task, where the task-specific parameters are inferred using a context set and some additional (trainable) parameters.
+Normalization layers are commonly used to accelerate the training of neural networks. The general approach is to use normalization moments (statistics) along with some learned parameters.
+BatchNorm is a well-known and widely used normalization approach. It relies on the implicit assumption that the dataset comprises of iid samples from some underlying distribution.
+However, in meta-learning, data points are assumed to be iid only within a specific task.
+This leaves open the question of what moments to use during meta-train and meta-test time.
+Compute moments at meta train time and use during meta test time.
+This is equivalent to lumping the moments with the global parameters. I.e., the running moments are shared globally, while the data is iid only locally.
+Using CBN with MAML leads to poor results.
+Moreover, meta-learning setup can some times require the use of a very small batch size. (e.g., 1-shot learning) In those cases, the computed statistics are likely to be inaccurate.
+Use context/target set statistics at both meta-train and meta-test time.
+This is the default BatchNorm mode used in MAML.
+Moments are computed separately for each instance.
+This mode corresponds to treating the statistics as local at the observation level.
+These methods provide only limited improvement in performance, and can sometimes have a large overhead.
+The normalization statistics are local at the task level, and statistics for a given data point should only depend on the context set’s data point. It should not depend on the other elements of the target set.
+Meta-Batch Normalisation (METABN) is a precursor to TaskNorm where the context set alone is used to compute the normalization statistics for both the context and the target set (during both meta-test and meta-train time).
+METABN does not perform well when used with small context sets.
+TaskNorm overcomes this limitation by using a set of non-transductive, secondary moments (computed from the input being normalized).
+When the context is small, using additional moments will help to improve the moment estimates.
+In the general case, a trainable blending factor, $\alpha$, is used to combine the two sets of moments.
+While the computational cost of TaskNorm is slightly more than CBN, it converges faster than CBN in practice.
+Normalization mechanism in Reptile can be interpreted as a particular case of TaskNorm.
+Small scale few-shot classification experiments
+ +Omniglot and imin ImageNet dataset
+First order MAML, with different kinds of normalization schemes.
+Transductive BatchNorm performs the best.
+Among non-transductive approaches, TaskNorm using Instance Normalisation augmentation performs the best.
+Similar trend holds for the speed of convergence as well.
+Large scale few-shot classification experiments
+ +MetaDataset dataset
+CNAPs model
+The context set’s size varies across tasks in this setup and can be as small as 5.
+TaskNorm with Instance Normalisation ranks first in 10 (out of 13) datasets and is also the fastest to train.
+While Instance-based methods (Instance Normalisation and Layer Normalisation) are the slowest to converge, they still outperform the running average based methods (conventional BatchNorm).
+The results demonstrate that designing meta-learning specific normalization methods can significantly improve performance and that Transductive BatchNorm may not always be the optimal choice.
+The paper proposes GradNorm, a gradient normalization algorithm that improves multi-task training by dynamically tuning the magnitude of gradients corresponding to different tasks.
+During multi-task training, some tasks can dominate the training, at the expense of others.
+It is common to define the multi-task loss as a linearly weighted combination of the individual task losses.
+The paper proposes two changes to this setup:
+ +Adapt weight-coefficients, assigned to each loss term, at each training step.
+Directly modify the gradient magnitudes, corresponding to different tasks, so that all the tasks are learning at similar rates.
+Proposed GradNorm algorithm is similar to BatchNorm, but it performs normalization across tasks, not data batches.
+Gradient norm at timestep $t$, for the $i^{th}$ task, is computed as the product between average gradient norm (across all tasks at timestep $t$) and $r_i(t) ^ {\alpha}$.
+$r_i$ is the relative inverse training rate of task $i$. It is defined as the ratio between the loss ratio of task $i$ and the average loss ratio (across all the tasks).
+$\alpha$ is a hyperparameter.
+This computed per-task gradient norm is treated as the target value for actual gradient norms.
+An additional $L_1$ loss is incorporated between the actual and the target gradient norms, summed over all the tasks, and optimizes the weight-coefficients only.
+After every step, the weight-coefficients are renormalized to decouple the gradient normalization from the global learning rate.
+Note that all the gradient norm computations are performed only for the layers on which GradNorm is applied. Generally, GradNorm is used with only the last shared layer of weights (to save on computational costs).
+Two variants of NYUv2 dataset – NYUv2+seg (small dataset) and NYUv2+kpts (big dataset).
+Both regression and classification setups were used.
+Models:
+ +SegNet with a symmetric VGG16 encoder/decoder
+FCN with modified ResNet-50 as the encoder and shallow ResNet as the decoder.
+Standard pixel-wise losses for each task.
+GradNorm with $\alpha=1.5$ outperforms the equal-weight baseline and either surpasses or matches the best performance of single networks for each task.
+Almost any value of 0 < $\alpha$ < 3 improves the network’s performance over an equal weight baseline.
+The paper hypothesizes that main optimization challenges in multi-task learning arise because of negative interference between different tasks’ gradients.
+It hypothesizes that negative interference happens when:
+ +The gradients are conflicting (i.e., have a negative cosine similarity).
+The gradients coincide with high positive curvature.
+The difference in gradient magnitude is quite large.
+The paper proses to work around this problem by performing “gradient surgery.”
+If two gradients are conflicting, modify the gradients by projecting each onto the other’s normal plane.
+This modification is equivalent to removing the conflicting component of the gradient.
+This approach is referred to as projecting conflicting gradients (PCGrad).
+Theoretical Analysis
+ +The paper proves the local conditions under which PCGrad improves multi-task gradient descent in the two-task setup.
+The conditions are:
+ +Angle between the task gradients is not too small.
+Difference in the magnitude of the gradients is sufficiently large.
+Curvature of the multi-task gradient is large.
+Large enough learning rate.
+Experimental Setup
+ +Multi-task supervised learning
+ +MutliMNIST, Multi-task CIFAR100, NYUv2.
+For Multi-task CIFAR-100, PCGrad is used with the shared parameters of the routing networks.
+For NYUv2, PCGrad is combined with MTAN.
+In all the cases, using PCGrad improves the performance.
+Multi-task Reinforcement Learning
+ +Meta-World Benchmark
+PCGrad + SAC outperforms all other baselines.
+In the context of SAC, the paper suggests learning temperature $\alpha$ on a per-task basis.
+Goal-conditioned Reinforcement Learning
+ +Goal-conditioned robotic pushing task with a Sawyer robot.
+PCGrad + SAC outperforms vanilla SAC.
+Conditional computation is a technique to increase a model’s capacity (without a proportional increase in computation) by activating parts of the network on a per example basis.
+The paper describes (and address) the computational and algorithmic challenges in conditional computation. It introduces a sparsely-gated Mixture-of-Experts layer (MoE) with 1000s of feed-forward sub-networks.
+GPUs are fast at matrix arithmetic but slow at branching.
+Large batch sizes amortizes the cost of updates. Conditional computation reduces the effective batch size for different components of the model.
+Network bandwidth can be a bottleneck with the network demand overshadowing the computational demand.
+Additional losses may be needed to achieve the desired level of sparsity.
+Conditional computation is most useful for large datasets.
+n Expert Networks - $E_1$, …, $E_n$.
+Gating Network $G$ to select a sparse combination of experts.
+Output of the MoE module is the weighted sum of predictions of experts (weighted by the output of the gate).
+If the gating network’s output is sparse, then some of the experts’ value does not have to be computed.
+In theory, one could use a hierarchical mixture of experts where a mixture of experts is trained at each level.
+Softmax Gating
+Noisy top-k gating - Add tunable Gaussian noise to the output of softmax gating and retain only the top-k values. A second trainable weight matrix controls the amount of noise per component.
+Shrinking Batch Problem
+ +If the MoE selects k out of n experts, the effective batch size reduces by a factor of k / n.
+This reduction in batch size is accounted for by combining data parallelism (for standard layers and gasting networks) and model parallelism (for experts in MoE). Thus, with d devices, the batch size changes by a factor of (k x d ) / n.
+For hierarchical MoE, the primary gating network uses data parallelism while secondary MoEs use model parallelism.
+The paper considers LSTM models where the MoE is applied once the previous layer has finished. This increases the batch size (for the current MoE layer) by a factor equal to the number of unrolling timesteps.
+Network Bandwith limitations can be overcome by ensuring that the ratio of computation (of each expert) to the input and output size is greater than (or equal to) the ratio of computational to network capacity.
+Computational efficiency can be improved by using larger hidden layers (or more hidden layers).
+Balancing Expert Utilization
+ +Importance of an expert (relative to a batch of training examples) is defined as the batchwise sum of the expert’s goal values.
+An additional loss, called importance loss, is added to encourage the experts to have equal importance.
+The importance loss is defined as the square of the coefficient of variation (of a set of importance values) multiplied by a (hand-tuned) scaling factor $w_{importance}$.
+In practice, an additional loss called $L_{load}$ might be needed to ensure that the different experts get equal load (along with equal importance).
+Datasets
+ +Billon Word Language modeling Benchmark
+100 Billion word Google News Corpus
+Machine Translation datasets
+ +Single Language Pairs - WMT’14 En to Fr (36M sentence pairs) and En to De (5M sentence pairs).
+Multilingual Machine Translation - large combine dataset of twelve language pairs.
+In all the setups, the proposed MoE models achieve significantly better results than the baseline models, at a lower computational cost.
+Common transfer learning method focuses on transferring knowledge in the model feature space.
+In contrast, the paper argues that the learned knowledge is more concisely captured in the “classifier space” as the classifier is fitted for all the samples for a given class, while the feature representation is specific to each sample.
+Building on this intuition, the paper proposes to combine strong classifiers (trained on large datasets) with weak classifiers (trained on smaller datasets) to improve the weak classifiers’ performance.
+Given $n$ classifiers, $C_1, …, C_n$, trained with a large amount of data and a weak classifier $a$ trained for a class with few samples.
+Find the nearest neighbors of $a$.
+Train a new classifier by linearly combining $a$ with its nearest classifiers.
+The coefficients (for linearly combining the classifiers) are learned using another classifier called as AlphaNet.
+In theory, this approach can be used with any set of classifiers.
+A long-tailed dataset is one where some classes (referred to as the tail classes) have very few examples—for example, ImageNet-LT and Places-LT.
+Split the long-tailed dataset into two splits - “base” classes with $B$ (number of) classes and “few” classes with $F$ (number of) classes.
+Total number of classes $N = B + F$.
+Start with a pre-trained model, with classifiers $w_j$ and biases $b_j$ for $j \in (1, N)$.
+For a given target class $j$, find its top $k$ nearest neighbor classifiers and concatenate their output.
+For each “few” class, learn a feedforward network that takes the concatenated representation (of classifiers) as the input and returns a vector of $k \alpha$ values.
+These $\alpha$ values are interpreted as the classifier’s strength (or confidence) in its nearest neighbors.
+The (normalized) alpha values are used for defining the weight and bias for the classifier for the given “few” class.
+The collection of all the “few” classifiers is referred to as the AlphaNet.
+The paper outlines a degenerate case, where the confidence in the prediction of all the strong classifiers goes to 0. The paper proposes to counter this case by clamping the $\alpha$ values.
+The entire setup is trained end-to-end using cross-entropy loss on AlphaNet.
+Given the proposed approach’s flexibility, it is used to combine the state-of-the-art models on ImageNet-LT, namely retraining classifiers on class-balanced samples and training models with weight normalization. The combined setup outperforms the individual models.
+One interesting observation is that it is useful to include the weak classifiers, along with the strong classifiers, as AlphaNet adjusts the position of weak classifiers towards the appropriate strong classifier.
+While the idea is described in the context of long-tail data distribution, the idea is useful in the general context of non-stationary data distribution. One instantiation could be lifelong class incremental learning where the model encounters new data classes during training. For some time duration (till sufficient data points are seen), the newly seen classes are the “few” classes. This approach can help with faster adaptation when the model is yet to see sufficient examples for the unseen classes.
+The paper investigates the practical impact of the deadly triad (function approximation, bootstrapping, and off-policy learning) in deep Q-networks (trained with experience replay).
+The deadly triad is called so because when all the three components are combined, TD learning can diverge, and value estimates can become unbounded.
+However, in practice, the component of the deadly triad has been combined successfully. An example is training DQN agents to play Atari.
+The effect of each component of the triad can be regulated with some design choices:
+ +Bootstrapping - by controlling the number of steps before bootstrapping.
+Function approximation - by controlling the size of the neural network.
+Off-policy learning - by controlling how data points are sampled from the replay buffer (i.e., using different prioritization approaches)
+The problem is studied in two contexts: toy example and Atari 2600 games.
+The paper makes several hypotheses about how different components may interact in the triad and evaluate these hypotheses by training DQN with different hyperparameters:
+ +Number of steps before bootstrapping - 1, 3, 10
+Four levels of prioritization (for sampling data from the replay buffer)
+Bootstrap target - Q-learning, target Q-learning, inverse double Q-learning, and double Q-learning
+Network sizes-small, medium, large and extra-large.
+Each experiment was run with three different seeds.
+The paper formulates a series of hypotheses and designs experiments to support/reject the hypotheses.
+Rewards are clipped between -1 and 1, and the discount factor is set to 0.99. Hence, the maximum absolute action value is bound to smaller than 100. This upper bound is used soft-divergence in the value estimates.
+The paper reports that while soft-divergence does occur, the values do not become unbounded, thus supporting the hypothesis.
+One manifestation of bootstrapping on separate networks is target-Q learning. While using separate networks helps on Atari, it does not entirely solve the problem on the toy setup.
+One manifestation of correcting for the overestimation bias is using double Q-learning.
+In the standard form, double Q-learning benefits by bootstrapping on a separate network. To isolate the gains by using each component independently, an inverse double Q-learning update is used that does not use a separate target-network for bootstrapping.
+Experimentally, Q-learning is the most unstable while target Q-learning and double Q-learning are the most stable. This observation supports the hypothesis.
+This hypothesis is intuitive as the dependence on bootstrapping is reduced with multi-step returns.
+Experimental results support this hypothesis.
+This hypothesis is based on the assumption that more flexible value function approximations may behave more like the tabular case.
+In practice, smaller networks show fewer instances of instability than the larger networks.
+The hypothesis is not supported by the experiments.
+Generally, soft-divergence correlates with poor control performance.
+For example, longer multi-step returns lead to fewer instances of instabilities and better performance.
+The trend is more interesting in terms of network capacity. Large networks tend to diverge more but also perform the best.
+While action-value estimates can grow to large values, they can recover to plausible values as training progresses.
+The paper presents an extensive study of the effects of experience replay in Q-learning based methods.
+It focuses explicitly on the replay capacity and replay ratio (ratio of learning updates to experience collected).
+Replay capacity is defined as the total number of transitions stored in the replay buffer.
+Age of a transition (stored in the replay buffer) is defined as the number of gradient steps taken by the agent since the transition was stored.
+More is the replay capacity, more will be the age of the oldest transition (also referred to as the age of the oldest policy).
+More is the replay capacity, more will be the degree of “off-policyness” of the transitions in the buffer (with everything else held constant).
+Replay ratio is the number of gradient updates per environment transition. This ratio can be used as a proxy for how often the agent uses old data (vs. collecting new data) and is related to off-policyness.
+In DQN paper, the replay ratio is set to be 0.25.
+For experiments, a subset (of 14 games) is selected from Atari ALE (Arcade Learning Environment) with sticky actions.
+Each experiment is repeated with three seeds.
+Rainbow is used as the base algorithm.
+Total number of gradient updates and batch size (per gradient update) are fixed for all the experiments.
+Rainbow used replay capacity of 1M and oldest policy of age 250K.
+In experiments, replay capacity varies from 0.1M to 10M ( 5 values), and the age of the oldest policy varies from 25K to 25M (4 values).
+With the age of the oldest policy fixed, performance improves with higher replay capacity, probably due to increased state-action coverage.
+With fixed replay capacity, reducing the oldest policy’s age improves performance, probably due to the reduced off-policyness of the data in the replay buffer.
+However, in some specific instances (with sparse reward, hard exploration setup), performance can drop when reducing the oldest policy’s age.
+Increasing replay capacity, while keeping the replay ratio fixed, provides varying improvements and depends on the particular values of replacy capacity and replay ratio.
+The paper reports the effect of these choices for DQN as well.
+Unlike Rainbow, DQN does not improve with larger replay capacity, irrespective of whether the replay ratio or age of the oldest policy is kept fixed.
+Given that the Rainbow agent is a DQN agent with additional components, the paper explores which of these components leads to an improvement in Rainbow’s performance as replay capacity increases.
+Four new DQN variants are created by adding each of Rainbow’s four components to the base DQN agent.
+DQN with n-step returns is the only variant that benefits by increased replay capacity.
+The usefulness of n-step returns is further validated by verifying that Rainbow agent without n-step returns does not benefit by increased replay capacity. While Rainbow agent without any other component benefits by the increased capacity.
+Prioritized Experience Replay does not significantly affect the performance with increased replay capacity.
+The observation that n-step returns are critical for taking advantage of larger replay sizes is surprising because the uncorrected n-step returns are theoretically not suitable for off-policy learning.
+The paper tests the limits of increasing replay capacity (with n-step returns) by performing experiments in the offline-RL setup, the agent collects a dataset of about 200M frames. These frames are used to train another agent.
+Even in this extreme setup, n-step returns improve the learning agent’s performance.
+Hypothesis 1: n-step returns help to counter the increased off-policyness produced by a larger replay buffer.
+ +Hypothesis 2: Increasing the replay buffer’s capacity may reduce the variance of the n-step returns.
+ +This hypothesis is evaluated by training on environments with lesser variance or by turning off the sticky actions in the atari domain.
+While the hypothesis does explain the gains by using n-step returns to some extent, n-step gains are observed even in environments with low variance.
+The paper introduces Multi-Object Network (MONet) architecture that learns a modular representation of images by spatially decomposing scenes into objects and learning a representation for these objects.
+Two components:
+ +Attention Module: generates spatial masks corresponding to the objects in the scene.
+VAE: learn representation for each object.
+VAE components:
+ +Encoder: It takes as input the image and the attention mask generated by the attention module and produce the parameters for distribution over latent variable z.
+Decoder: It takes as input the latent variable z and attempts to reproduce the image.
+The decoder loss term is weighted by mask, i.e., the decoder tries to reproduce only those parts of the image that the attention mask focuses on.
+The attention mechanism is auto-regressive with an ongoing state (called a scope) that tracks which parts of the image are not yet attended over.
+In the last step, no attention mask is computed, and the previous scope is used as-is. This ensures that all the masks sum to 1.
+The VAE also models the attention mask over the components, i.e., the probability that the pixels belong to a particular component.
+A model could efficiently process compositional visual scenes if it can exploit some recurring structures in the scene.
+The paper validates this hypothesis by showing that an autoencoder performs better if it can build up the scenes compositionally, processing one mask at a time (these masks are ground-truth spatial masks) rather than processing the scene at once.
+VAE encoder parameterizes a diagonal Gaussian latent posterior with a spatial broadcast decoder that encourages the VAE to learn disentangled features.
+MONet with seven slots is trained on Objects Room dataset with 1-3 objects.
+ +It learns to generate different attention mask for different objects.
+Combining the reconstructed components using the corresponding attention masks produces good quality reconstruction for the entire scene.
+Since it is an autoregressive model, MONet can be evaluated for more slots. The model generalizes to novel scene configurations (not seen during training).
+On the Multi-dSprites dataset (modification of the dSprites dataset), the model (post-training) distinguishes individual sprites and background.
+On the CLEVER data (2-10 objects per image), the model generates good image segmentation and reconstructions and can distinguish between overlapping shapes.
+A classic paper that looks into strategies for scaling large systems that can tolerate graceful degradation.
+CAP refers to strong Consistency, high Availability, and Partitionability.
+Strong consistency refers to single copy ACID consistency.
+High availability means any consumer can access the data anytime. Generally, this is achieved by adding one or more data replicas.
+Partitionability means that the system can survive a partition between the different replicas.
+Strong CAP theorem states that any system can have only two out of three properties.
+Weak CAP theorem says that stronger are the guarantees about any two properties, weaker are the third property’s guarantees.
+Assume that the clients are making a request to a server.
+There are two quantities of interest here:
+ +In the presence of faults, a tradeoff can is made between yield and harvest. This tradeoff applies to both read and update queries.
+In a hundred node cluster (without replication), a single-node failure reduces harvest by 1 %, and in the case of multi-node failure, the harvest degrades linearly.
+The probability of losing high-priority data can be reduced by replicating it. However, replicating all the data would not n guarantee 100% harvest and yield despite significant costs.
+Decompose a large application into subcomponents so that each component can be provisioned separately. Strong consistency can only be applied only on the components that need it, instead of the application as a whole.
+Further, failure of one or more components need not cause the application to fail as a whole.
+Decomposition also provides the opportunity to use orthogonal mechanisms, i.e., mechanisms independent of other mechanisms with no runtime interface.
+Composition of orthogonal subsystems improves the robustness of runtime interactions by locally containing the errors. For example, the orthogonal components can be restarted /replaced independently without affecting other running components.
+The paper presents a formalism for transfer learning, offers a definition of relatedness between tasks, and proposes foliations as a mathematical framework to represent the relationship between tasks.
+The term representation denotes a mechanism for describing and realizing abstract objects, thus allowing manipulation and reasoning about the objects. This description goes beyond the usual meaning (in deep learning), where representation denotes some useful information about data.
+Relatedness describes what changes between tasks. Consider a set of transformations (or functions) that convert one task to another. A relationship between two tasks is an element of this transformation set.
+Given a transformation set, one can define a set of related tasks, which is the set of all the tasks that can be transformed into each other using the functions from the given transformation set. This set of tasks is an equivalence class, and the transformation set is the equivalence relationship.
+Given two related tasks t1 and t2, denote the corresponding models (trained on those tasks) as m1 and m2. One can assume that m1 and m2 are related in the same way as t1 and t2 (equivariance).
+Now, given a set of transformations, one can partition the space of continuous functions into non-overlapping spaces, which describe a set of related tasks. These spaces are referred to as the parallel spaces or transfer spaces.
+The parallel space represents a lower dimension than the original space. So knowing which parallel space a model lies on can make it easier to find it. This is the primary motivation behind transfer learning - knowing the relationship between tasks can make it easier to find a solution to new tasks.
+Another way of partitioning the set of transformations is to use tessellation (e.g., Voronoi diagrams). Tasks in the same partition are similar to each other as compared to a task from another partition.
+Two tasks are defined as similar if the distance between them (under some distance metric) is small.
+Similarity is a geometric notion, while relatedness is a transformative notion. Parallelized space is to relatedness what tessellation is to similarity.
+The distinction between similarity and relatedness is quite nuanced, and the authors provide several examples to differentiate between them.
+Similarity can only be measured in terms of a reference element (similar to what). For example, when one finetunes a pre-trained model on a new task, one assumes that the model’s pretraining task is similar to the current task.
+Given a set (say T), a quantity (a function that maps elemenets of T to a k dimensional vector) is said to be invariant with respect to a transformation p (defined on T) if q(f) = q(p(f)) ie the value of f (belonging to T) does not change if f is transformed by p.
+If one assumes that the set of transformations is a group, specifically a Lie group whose action on the set of tasks is locally free and regular, then one can define a parallel partitioning of the space of tasks and the space of models.
+One can develop a hierarchial categorization scheme for the set of all considered tasks using the invariant quantities.
+One can consider the space of tasks and models to be smooth manifolds as manifolds naturally give a notion of representation and transformations between them.
+A manifold is a topological space that can be locally mapped to a Euclidean space using coordinate charts. One can define regular foliation by choosing charts that satisfy certain conditions. In that case, the manifold has immersed, connected, non-intersecting submanifolds called leaves.
+The charts (that satisfies those conditions) give a set of rectified coordinates, where the notions of “which leaf a point is on” and “where on the leaf it is” are clearly separated.
+Thus, foliation can provide the theoretical tools to work with parallel spaces.
+How can the foliations be incorporated into theory and solutions for transfer learning is left aa future work.
+Key idea: Practicing and remembering diverse solutions to a task can lead to robustness to that task’s variations.
+The paper proposes a framework to implement this idea - train multiple policies such that they are collectively robust to a new distribution over environments while using a single training environment.
+During training, the agent has access to only one MDP.
+During the evaluation, the agent encounters a new MDP which has the same state and action space but may have a different reward and transition function.
+The agent is allowed some interactions (say k) with the test MDP and is then evaluated on the test MDP. The setup is referred to as few-shot robustness.
+Represent a set of policies using a latent variable policy (i.e., a policy conditioned on a latent variable z).
+This has two benefits: (i) Multiple policies can be represented by the same object, and (ii) diverse behaviors can be learned by encouraging the trajectories, corresponding to different z to be different, while being able to solve the task.
+A diversity-inducing objective is used to encourage the agent to learn different trajectories for different z.
+Specifically, the mutual information between p(Z) and marginal trajectory distribution for the latent variable policy is maximized, subject to the constraint that each policy achieves close to optimal returns in the train MDP.
+The mutual information between p(Z) and marginal trajectory distribution for the latent variable policy is lower bounded by the sum of mutual information terms over individual states (appearing in the trajectory).
+An unsupervised reward function is defined using the mutual information between states and latent variables.
+\(r(s, a) = log(q_{\phi})(z\|s) - log(p(z))\) where \(q_{\phi}\) is a learned discriminator.
+This unsupervised reward is optimized for only when the policy achieves close to an optimal return, i.e., the environment return is close to the optimal return. Otherwise, the agent optimizes only for the environment return.
+SMERL is implemented using SAC with a latent variable maximum entropy policy.
+The set of latent variables is a fixed discrete set \(Z\) and \(p(z)\) is set to be a uniform distribution over this set.
+At the start of an episode, a \(z\) is sampled and used throughout the episode.
+Discriminator \(q_{\phi}(z\|s)\) is trained to infer \(z\) from the visited states.
+A baseline SAC agent is trained beforehand to evaluate if the current training policy achieves close to optimal environment return.
+During the evaluation, the policy corresponding to each latent variable is executed in the test MDP, and the policy with the maximum return is returned.
+Given an MDP \(M\) and \(\epsilon>0\), the MDP robustness set is defined as the set of all MDPs \(M'\) where the optimal policy of \(M'\) produces the same trajectory distribution in \(M'\) as \(M\). Moreover, on the training MDP \(M\), the optimal policies (corresponding to \(M\) and \(M'\)) obtain similar returns.
+The paper shows that SMERL generalizes to MDPs belong to the robustness set.
+It also provides a simplified view of the optimization objective and shows how it naturally leads to a trajectory-centric mutual information objective.
+Environments
+ +2D navigation environments with point mass.
+Mujoco Environments: HalfCheetah-Goal, Walker2d-Velocity, Hopper-Velocity.
+On the 2D navigation environment, the paper shows that SMERL learns to use different trajectories to reach the goal.
+On the Mujoco setup, the evaluation shows that SMERL generally outperforms the best-performing baseline or is close to the best-performing baseline on different tasks.
+Generally, higher train performance does not correlate with higher test performance, and there is no single policy that performs the best across all the tasks. Thus, it should be beneficial to learn multiple diverse policies that can be selected from during testing.
+The paper describes the efforts to control and repay the technical debt in the build system at Google (called the Build Debt).
+Guiding Principles:
+ +Automate techniques to analyze and fix issues that contribute to technical debt.
+Make it easier to do the right thing as developers can incur technical debt unknowingly.
+Make it hard to do the wrong thing, e.g., by building stricter checks into the build process.
+Note that some of the metrics and design decisions may be outdated now (the paper was written in 2012). However, the core message is still relevant.
+BUILD files encapsulate the specifications for building software.
+Generally, these files are maintained manually, and the dependencies may not be up-to-date over time.
+In extreme cases, some of the build targets are not built for months. Such targets are called zombie targets.
+Originally, any project could depend on any other project’s internal details, thus creating (sometimes unwanted) couplings.
+If the lower-level project did not intend to expose some internal details, the unwanted couplings introduce technical debt and make it harder to modify the lower-level project.
+One form of technical debt is the visibility debt or the cost of back-fitting visibility rules onto the existing build specifications to re-establish the appropriate encapsulations.
+Another example of technical debt is dead code that can confuse the developers looking for useful APIs.
+Over-declared or underutilized dependencies can slow the build and testing of systems.
+Under-declared dependencies can make the build process brittle and make it difficult to remove over-declared dependencies.
+Potential solutions for over-declared dependencies include:
+ +Setting aside some dedicated time for fixing build rules. But this approach is not automated, and potential breakages make it harder for developers to do the right thing.
+Automatically add all the under-declared dependencies to the BUILD files. The system can raise an error if a direct dependency is missing, making it harder to do the wrong thing.
+Automation can be applied for finding/reporting the over-declared dependencies as well.
+Potential solutions for underutilized dependencies include:
+ +While it is challenging to automate fixing underutilized dependencies, automating the discovery of such dependencies is still useful.
+Highlighting dependencies with high cost and low removal effort could incentivize developers to clean up their projects.
+Zombie targets can be identified by query the results of build and test runs.
+A target is marked as “dead” if the attempts to build it have failed for at least 90 days. Until then, build errors are considered to be transient.
+A zombie target can be eliminated by deleting its definition from the BUILD and deleting the source files, which are reachable only via the zombie target.
+Originally, the default visibility of all the targets was public, leading to unintended dependencies.
+The visibility of all the existing builds was set to legacy_public, and the default visibility was changed to private.
+This encouraged developers to explicitly consider if they wanted other projects to depend on their project.
+Google developed its command-line parsing utilities and defined a set of recognized command-line flags for libraries and binaries.
+Overtime, the number of flags grew to half a million, and many of these flags are not useful anymore (i.e., dead).
+These dead flags can it hard to understand and refactor code.
+Existing flags are analyzed to check which ones have always been set to the same value and replaced by those contents, clearing about 150 thousand flags.
+Removing dead flags also helps to clean up dead/unreachable code.
+The paper describes the architecture of an erstwhile single-sign-on (SSO) service used by Google, called Google Accounts (2006).
+Note that some of the metrics and design decisions may be outdated now (the paper was written in 2006). However, the core message is still relevant.
+SSO’s availability affects the availability of all applications that require user sign-in.
+Generally, systems can achieve high availability by sacrificing consistency, but given the nature of SSO (matching username/passwords), providing an inconsistent view is not a good option, and single-copy consistency is a usability requirement.
+Berkeley DB is an embedded, high-performance, scalable, transactional storage system for key-value data and provides both keyed and sequential lookup.
+It provides a primary copy replication model with a single writer (called master) and multiple read-only replicas.
+All writes are sent to the master, which first applies the changes and then propagates them to the replicas.
+The master and the replicas have identical logs, and in case of master failure, a new master is elected from the replicas.
+Some synchronization may be needed between the replicas in case, e.g., the master dies in between a transaction.
+SSO service maps usernames to user account data and services to service-specific data.
+The SSO database is partitioned into shards, where each shard is a replicated Berkeley DB (having 5 to 15 replicas).
+Each replica stores the data in a B+-link tree data structure.
+Consistent reads must go to the master, while non-master replicas can serve “ stale” reads.
+In the case of larger replication groups (say 15 replicas), only a subset of replicas can become master (“electable replicas”).
+In general, replicas are spread geographically to handle machine-failure, network-failure, and data center-failure.
+Replicas in a share are kept close to reduce the communication latency, which affects the time to commit a write operation or electing a new master.
+Some of the shards implement ID-map, i.e., map of username to userid and userid to shards.
+SSO chooses a quorum protocol that guarantees that updates are never lost.
+For the write queries, the master waits for a positive acknowledgment from a majority of the replicas, including itself, before marking the query as completed.
+When selecting a new leader, SSO requires a majority of replicas to agree. Moreover, Berkeley DB elections always choose a replica with the latest log entry during an election, thus guaranteeing that the new master’s log will include all the previous master’s updates.
+The master holds a master lease when responding to read queries and refreshes this lease periodically by communicating with a majority of replicas.
+The lease guarantees that the master is not returning stale data if a partition or failure causes the master to lose its mastership, i.e., holding the lease guarantees that the master is still the master.
+Moreover, elections can not be completed within the lease timeout interval.
+SSO maintains a replica configuration containing the logical (DNS) name and IP address of each replica.
+In case of any changes to the configuration, the changes are specified in a file that the master reads periodically.
+If the configuration changes, the master initiates a configuration change and update the database.
+Non-master replicas can get the new configuration from the database.
+A new replica or a replica that lost state (say due to a failure) starts as a non-voting replica and can not participate in an election till it has caught up with the master as of the time the replica joined (again).
+The paper shows that Siamese networks can be used for unsupervised learning with images without needing techniques like negative sample pairs, large batch training, or momentum encoders. The training mechanism is referred to as the SimSiam method.
+Given an input image x, create two augmented views x1 and x2.
+These views are processed by an encoder network f.
+One of the views (say x1) is processed by the encoder f as well as a predictor MLP h to obtain a projection p1 ie p1 = h(f(x1)).
+The second view (x2) is processed only by the encoder f to obtain an encoding z2 i.e., z2 = f(x2).
+Negative cosine similarity is minimized between p1 and z2 with the catch that the resulting gradients are not used to update the encoder via z2. I.e., Loss = D(p1, stopgrad(z2)) where D is the negative cosine similarity and stopgrad is an operation that stops the flow of gradients.
+In practice, both p1, z2 and p2, z1 pairs are used for computing the loss. ie Loss = 0.5 * (D(p1, stopgrad(z2)) + D(p2, stopgrad(z1))).
+Encoder uses batch norm in all the layers (including output) while projection MLP uses batch norm only in the hidden layers.
+SGD optimizer with learning rate as 0.05 * batchsize / 256, cosine learning rate decay schedule and SGD momentum = 0.9.
+Unsupervised pretraining on the ImageNet dataset followed by training a supervised linear classifier on the frozen representations.
+Stop-gradient operation is necessary to avoid a degenerate solution. Without stop-gradient, the model maps all inputs to a constant z.
+If the projection layer is removed, the method does not work (because of the loss’s symmetric nature). If the loss is also made asymmetric, the method still does not work without the projection layer. However, asymmetric loss + projection layer works.
+Keeping the projection layer fixed (i.e., not updating during training) avoids collapse but leads to poor validation performance.
+Training the projection layer with a constant learning rate works better in practice, likely because the projection layer needs to keep adapting before the encoder layer is sufficiently trained.
+The method works well across different batch sizes.
+Removing batch norm layers from all the layers in all the networks does not lead to collapse, though the model’s performance degrades on the validation dataset. Adding batch norm to the hidden layers alone is sufficient.
+Adding batch norm to the encoder’s output further improves the performance but adding batch norm to all the layers of all the networks makes the training unstable, with the loss oscillating.
+Overall, while batch norm helps to improve performance, it is not sufficient to avoid collapse.
+The setup does not collapse when the cross-entropy loss replaces the cosine loss.
+Given that the stop-gradient operation seems to be the critical ingredient for avoiding collapse, the paper hypothesizes that SimSiam is solving a different optimization problem.
+The hypothesis is that SimSiam is implementing an Expectation-Maximisation (EM) algorithm with two sets of variables and two underlying sub-problems.
+The paper performs several experiments to test this hypothesis. For example, they consider k SGD steps for the first problem before performing an update for the second problem, showing that the alternating optimization is a valid formulation, of which SimSiam is a particular case.
+SimSiam achieves the highest accuracy among SimCLR, MoCo, BYOL, and SwAV for training under 100 epochs. However, it lags behind other methods when trained longer.
+SimSiam’s representations are transferable beyond the ImageNet tasks.
+Adding projection layer and stop-gradient operator to SimCLR does not improve its performance.
+CAP theorem has been influential in the design decisions for distributed databases.
+However, designers incorrectly assume that the CAP theorem “always” imposes restrictions in terms of the tradeoff between availability and consistency. In contrast, the tradeoff is applicable only in the case of partitions.
+CAP theorem led to the development of highly available systems with reduced consistency models (and reduced ACID guarantees).
+Another tradeoff - between latency and consistency - has also been influential for database design.
+The paper unifies CAP and latency-consistency tradeoffs into a single formulation called PACELC.
+Note that some of the observations, especially ones about the databases, may be outdated now (the paper was written in 2012). However, the core message is still relevant.
+Low latency (or high availability) means that the system must replicate data.
+In case of an update query, three possibilities arise:
+ +The system can choose to send data updates to all the replicas at once. This leads to two possibilities:
+ +A replica can receive the update queries in an arbitrary order, thus breaking consistency with other replicas.
+Alternatively, the replicas could use some protocol to agree on the order of updates. However, this can introduce latency.
+The update queries can be first sent to a master replica.
+ +The master replica can apply the updates and send them to the other replicas using one of the following strategies:
+ +Synchronous replication where the master waits for all the updates to be applied to a replica(s). However, this approach introduces latency.
+Asynchronous replication where the master assumes the update to be complete before it completes. In this case, the latency-consistency tradeoff depends on how read queries are handled:
+ +The system can send all read queries to the master. In this case, there are no consistency issues, but additional latency is introduced because all the read queries go to the same replica, thus potentially overloading it.
+Alternatively, the read query can be served from any replica. While this improves read latency, the results can be inconsistent now.
+Use a mix of Synchronous and Asynchronous replication - i.e., some of the write queries are Synchronous, and others are Asynchronous. In this case, the latency-consistency tradeoff depends on how read queries are handled:
+ +If the read is routed to at least one replica that has been Synchrnously updated, the consistency can be preserved, with additional latency for discovering the updated replica, etc.
+If the read query can not be routed to an updated replica (maybe because none of the replicas is updated), then either latency suffers or inconsistent read can be performed.
+The update query is first sent to an arbitrary replica.
+ +In a nutshell, the tradeoff between latency and consistency is always present, irrespective of network failure.
+This contrasts with the CAP theorem, which imposes the tradeoff between availability and consistency only in the case of a network partition.
+If there is a partition (P), how does the system tradeoff availability (A) and consistency (C); else (E), when the system is running without failures, how does the system tradeoff latency (L) and consistency (C)?
+The latency-consistency tradeoff (ELC) is relevant only when the data is replicated.
+Default versions of Dynamo, Cassandra, and Riak were PA/EL systems, i.e., if a partition occurs, availability is prioritized. In the absence of partition, lower latency is prioritized.
+Fully ACID systems (VoltDB, H-Store, and Megastore) and others like BigTable and HB are PC/EC, i.e., they prioritize consistency and give up availability and latency.
+MongoDB can be classified as a PA/EC system, while PNUTS is a PC/EL system.
+The CAP theorem states that any system sharing data over the network can only have at most two (out of three) desirable properties:
+ +consistency (C), i.e., a single, up-to-date copy of the data;
+high availability (A) of that data (for updates); and
+tolerance to network partitions (P).
+This “2 of 3” formulation is misleading as it oversimplifies the interplay between properties.
+ACID is a design philosophy that focuses on consistency as reflected in the traditional relational databases.
+The four properties in ACID are:
+ +Atomicity (A), i.e., the operations are atomic, and either the entire operation succeeds or none of it succeeds.
+Consistency (C), i.e., a transaction preserves all the rules. Note that the consistency in CAP is a subset of consistency in ACID.
+Isolation (I), i.e., transactions occur in isolation and do not affect each other.
+Durability (D), i.e., the transactions are durable irrespective of system failure.
+BASE is an alternate design philosophy that focuses on availability as reflected in the NoSQL databases.
+The four properties in BASE are:
+ +Basic Availability (BA), i.e., the database appears to work most of the time.
+Soft state (S), i.e., the system’s state can change over time as it becomes eventually consistent.
+Eventual consistency (E), i.e., the system will eventually become consistent over time.
+Generally, partitionability is seen as a must-have, thus reducing the choice to be between availability and consistency.
+This view is somewhat misleading because the choice between C, A, and P is not binary but granular.
+The choice between C and A can occur at various granularity levels, and different components (of a larger system) can prioritize different aspects.
+Similarly, the CAP theorem generally ignores latency even though it is closely related to partitionability. For example, failing to achieve consistency within a time-bound (i.e., latency) implies a partition.
+In general, there is no global notion of partition - some subset of nodes may experience a partition, and others may not.
+Once a partition is detected, the system can then choose between C and A.
+Three-step process for managing partitions:
+ +Detect the start of a partition.
+Enter an explicit partition mode that may limit some operations.
+ +Possible strategies:
+ +Reduce availability by limiting some operations.
+Record extra information that can be used during partition recovery.
+The strategy depends on the invariants that the system should maintain.
+For example, if the invariant is that the keys (in a table) should be unique, the system could allow duplicate keys for some time and perform a de-duplication step during partition recovery.
+A counterexample is a monetary transaction (e.g., charging a credit card). In such cases, the system could disable the operation and record it for performing later. Sometimes this “unavailability” is not visible to the user.
+History of operations (over replicas across different partitions) can be tracked using version vectors of the form (node, logical time). The system can easily recreate the order in which they were executed (or mark them as being concurrent).
+Initiate partition recovery when communication is restored and make the state across the partitions consistent.
+One common approach is to revert to the state when the partition was detected and apply the operations consistently across all the replicas.
+This may require some extra effort to merge conflicts.
+One workaround can be to constrain the use of certain operations so that the system does not encounter merge conflicts during recovery.
+Sometimes, certain invariants may be violated when the system is in the partition mode and needs to be fixed during recovery.
+The key takeaway is that when partitions exist, the choice between availability and consistency is not binary, and both can be optimized for.
+The paper describes a method to explain/interpret the representations learned by individual neurons in deep neural networks.
+The explanations are generated by searching for logical forms defined by a set of composition operators (like OR, AND, NOT) over primitive concepts (like water).
+Given a neural network f, the goal is to explain a neuron’s behavior (of this network) in human-understandable terms.
+Previous work builds on the idea that a good explanation is a description that identifies the inputs for which the neuron activates.
+Given a set of pre-defined atomic concepts $c \in C$ and a similarity measure $\delta(n, c)$ where $n$ represents the activation of the $n^{th}$ neuron, the explanation, for the $n^{th}$ neuron, is the concept most similar to $n$.
+For images, a concept could be represented as an image segmentation map. For example, the water concept can be represented by the segments of the images that show water.
+The similarity can be measured by first thresholding the neuron activations (to get a neuron mask) and then computing the IoU score (or Jaccard Similarity) between the neuron mask and the concept.
+One limitation of this approach is that the explanations are restricted to pre-defined concepts.
+The paper expands the set of candidate concepts by considering the logical forms of the atomics concepts.
+In theory, the search space would explode exponentially. In practice, it is restricted to explanations with at most $N$ atomics concepts, and beam search is performed (instead of exhaustive search).
+Image Classification Setup
+ +Neurons from the final 512-unit convolutional layer of a ResNet-18 trained on the Places365 dataset.
+Probing for concepts from ADE20k scenes dataset with atomic concepts defined by annotations in the Broden dataset
+NLI Setup
+ +BiLSTM baseline followed by MLP layers trained on Stanford Natural Language Inference (SNLI) corpus.
+Probing the penultimate hidden layer (of the MLP component) for sentence-level explanations.
+Concepts are created using the 2000 most common words in the validation split of the SNLI dataset.
+Additional concepts are created based on the lexical overlap between premise and hypothesis.
+Image Classification Setup
+ +As $N$ increases, the mean IoU increases (i.e., the explanation quality increases) though the returns become diminishing beyond $N=10$.
+Manual inspection of 128 neurons and their length 10 explanations show that 69% neurons learned some meaningful combination of concepts, while 31% learned some unrelated concepts.
+The meaningful combination of concepts include:
+ +perceptual abstraction that is also lexically coherent (e.g., “skyscraper OR lighthouse OR water tower”).
+perceptual abstraction that is not lexically coherent (e.g., “cradle OR autobus OR fire escape”).
+specialized abstraction of the form L1 AND NOT L2 (e.g. (water OR river) AND NOT blue).
+NLI Setup
+ +As $N$ increases, the mean IoU increases (as in the image classification setup) though the IoU keeps increasing past $N=30$.
+Many neurons correspond to lexical features. For example, some neurons are gender-sensitive or activate for verbs like sitting, eating or sleeping. Some neurons are activated when the lexical overlap between premise and hypothesis is high.
+In image classification setup, the more interpretable the neuron is, the more accurate is the model (when the neuron is active).
+However, the opposite trend is seen in NLI models. i.e., the more interpretable neurons are less accurate.
+Key takeaway - interpretability (as measured by the paper) is not correlated with performance. Given a concept space, the identified behaviors may be correlated or anti-correlated with the model’s performance.
+The idea is to construct examples that activate (or inhibit) certain neurons, causing a change in the model’s predictions.
+These adversarial examples are referred to as “copy-paste” adversarial examples.
+For example, the neuron corresponding to “(water OR river) AND (NOT blue)” is a major contributor for detecting “swimming hole” classes. An adversarial example is created by making the water blue. This prompts the model to predict “grotto” instead of “swimming hole.”
+Similarly, in the NLI model, a neuron detects the word “nobody” in the hypothesis as highly indicative of contradiction. An adversarial example can be created by adding the word “nobody” to the hypothesis, prompting the model to predict contradiction while the true label should be neutral.
+These observations support the hypothesis that one can use explanations to create adversarial examples.
+The paper introduces GPipe, a pipeline parallelism library for scaling networks that can be expressed as a sequence of layers.
+Consider training a deep neural network with L layers using K accelerators (say GPUs).
+Each of the ith layer has its forward function fi, backward function bi, weights wi and a cost ci (say the memory footprint or computational time).
+GPipe partitions this network into K cells and places the ith cell on the ith accelerator. Output from the ith accelerator is passed to the i+1th accelerator as input.
+During the forward pass, the input batch (of size N) is divided into M equal micro-batches. These micro-batches are pipelined through the K accelerators one after another.
+During the backward pass, gradients are computed for each micro-batch. The gradients are accumulated and applied at the end of each minibatch.
+In batch normalization, the statistics are computed over each micro-batch (used during training) and mini-batch (used during evaluation).
+Micro-batching improves over the naive mode parallelism approach by reducing the underutilization of resources (due to the network’s sequential dependencies).
+GPipe supports re-materialization (or checkpointing), i.e., during the forward pass, only the output activations (at partition boundaries) are stored.
+During backward pass, the forward function is recomputed at each accelerator. This trades off the memory requirement with increased time.
+One potential downside is that partitioning can introduce some idle time per accelerator (referred to as the bubble overhead). However, with a sufficiently large number of micro-batches ( more than 4 times the number of partitions), the bubble overhead is negligible.
+Two different types of model architectures are compared: AmoebaNet convolutional model and Transformer sequence-to-sequence model.
+For AmoebaNet, the size of the largest trainable model (on a single 8GB Cloud TPU v2) increases from 82M to 318M. Further, a 1.8 billion parameter model can be trained on 8 accelerators (25x improvement in size using GPipe).
+For transformers, GPipe scales the model size to 83.9 B parameters with 128 partitions (298x improvement in size compared to a single accelerator).
+Since the computation is evenly distributed across transformer layers, the training throughput scales almost linearly with the number of devices.
+Quantitative experiments on ImageNet and multilingual machine translation show that models can be effectively trained using GPipe.
+The paper proposes to use Energy-based Models (EBMs) for Continual Learning.
+In classification tasks, the standard approach uses a cross-entropy objective function along with a normalized probability distribution.
+However, cross-entropy reduces all negative classes’ likelihood when updating the model for a given sample, potentially leading to catastrophic forgetting.
+Classification can be seen as learning an EBM across separate classes.
+During an update, the energy for a pair of samples and its ground truth class decreases while the energy corresponding to the pairs of sample and negative classes increases.
+Unlike the cross-entropy loss, EBMs allow choosing the negative classes to update.
+EBMs can be used for class-incremental learning without requiring a replay-buffer or generative model for replay.
+EBMs can be used for continual learning in setups without task boundaries, i.e., setups where the data distribution can change without a clear separation between tasks.
+Boltzman distribution is used to define the conditional likelihood of label $y$, given an input $x$. ie, $p(y|x) = \frac{exp(E(x, y))}{Z(x)}$ where $Z(x) = \sum_{y \in Y}(-E(x, y))$. Here $E$ is the learnt energy function that maps an input-label pair to a scalar energy value.
+During training, the contrastive divergence loss is used.
+During inference, the class, for which the input-class pair has the least energy, is selected as the predicted class.
+The paper considers several strategies for the selection of negative samples:
+ +one negative class per sample. The negative class is sampled from the current batch of data. This selection approach performs best.
+all the negative classes in a batch are used for creating the negative samples.
+all the classes seen so far in training are used as the negative samples. This approach works the worst in practice.
+Given the flexibility of sampling the negative classes, EBMs can be used in the boundary-agnostic setups (where the data distribution can change smoothly without an explicit task boundary).
+EBMs take both the sample and the class as the input. The class can be treated as an attention filter to select the most relevant information between the sample and the class.
+In theory, EBMs can train for any number of classes without knowing the number of classes beforehand. This is an advantage over the softmax-based approaches, where adding new classes requires changing the size of the softmax output layer.
+Split MNIST
+Permuted MNIST
+CIFAR-10
+CIFAR-100
+The paper outperforms the standard continual learning approaches that neither uses a replay-buffer nor a generative model.
+Additionally, the paper shows that for the same number of parameters, the effective capacity of EMB models is higher than the effective capacity of standard classification models.
+The paper also shows that standard classification models tend to assign a high probability to new classes for both old and new data. EBMs assign the probability more uniformly (and correctly) across the classes.
+In an ablation study, the paper shows that both label conditioning and contrastive divergence loss help in improving the performance of EBMs.
+The paper explores HyperNetworks. The idea is to use one network (HyperNetwork) to generate the weights for another network.
+Consider a $D$ layer CNN where the parameters for the $j^{th}$ layer are stored in a matrix $K^j$ of the shape $N_{in}f_{size} \times N_{out}f_{size}$.
+The HyperNetwork is implemented as a two-layer linear network where the input is a layer embedding $z^j$, and the output is $K^j$.
+The first layer (of the HyperNetwork) maps the input to $N_{in}$ different outputs using $N_{in}$ weight matrices.
+The second layer maps the different $N_{in}$ inputs to $K_{i}$ using a shared matrix. The resulting $N_{in}$ (number of) $K_{i}$ matrices are concatenated to obtain $K^j$.
+As a side note, HyperNetworks have much fewer params than the network for which it produces weights.
+In a general case, the kernel dimensions (across layers) are not of the same size but integer multiples of some basic sizes. In that case, the HyperNetwork can generate kernels for the basic size, which can be concatenated to form larger kernels. This would require additional input embeddings but not require a change in the architecture of HyperNetwork.
+HyperRNNs/HyperLSTMs denote HyperNetworks that generates weights for RNNs/LSTMs.
+HyperRNNs implement a form of relaxed weight sharing - an alternative to the full weight sharing of the traditional RNNs.
+At any timestamp $t$, the input to the HyperRNN is the concatenated vector $x_{t}$ (input to the RNN at time $t$) and the hidden state $h_{t-1}$ of the RNN. The output is the weight for the main RNN at timestep $t$.
+In practice, a weight scaling vector $d$ is used to reduce the memory footprint, which would otherwise be $dim$ times the memory of a standard RNN. $dim$ is the dimensionality of the embedding vector $z_j$.
+HyperNetworks are used to train standard CNNs for MNIST and ResNets for CIFAR 10. In these experiments, HyperNetworks slightly underperform the best performing models but uses much fewer parameters.
+HyperLSTMs trained on the Penn Treebank dataset and Hutter Prize Wikipedia dataset outperform the stacked LSTMs and perform similar to layer-norm LSTMs. Interestingly, using HyperLSTMs with layer-norm improves performance over HyperLSTMs.
+Given the similar performance of HyperLSTMs and layer-norm LSTMs, the paper conducted an ablation study to understand if HyperLSTMs learned a weight adjustment policy similar to the statistics-based approach used by layer-norm LSTMs.
+ +HyperLSTMs are also evaluated for handwriting sequence generation by training in the IAM online handwriting dataset.
+ +On the WMT’14 En-to-Fr machine translation task, HyperLSTMs outperform LSTM based approaches.
+The paper introduces HYPTER - a framework for zero-shot learning (ZSL) in text-to-text transformer models by training a HyperNetwork to generate task-specific adapters from task descriptions.
+The focus is on in-task zero-shot learning (e.g., learning to predict an unseen class or relation) and not on cross-task learning (e.g., training on sentiment analysis and evaluating on question-answering task).
+Task - a NLP task, like classification or question answering.
+Sub-task
+ +A class/relation/question within a task.
+Denotes by a tuple $(d, D)$ where $d$ is the language description while $D$ represents the subtask’s dataset.
+HYPTER has two main parts:
+ +Main network
+ +A pretrained text-to-text network
+Instantiated as a BERT-Base/Large
+HyperNetwork
+ +HyperNetwork has two parts:
+ +Encoder
+ +Encodes the task description
+Instantiated as a RoBERTa-Base model
+Decoder
+ +Decodes the encoding into weights for multiple adapters (in parallel)
+Instantiated as a Feedforward Network
+The model trains in two phases:
+ +Main network is trained on all the data by concatenating the task description with the input.
+Adapters are trained by sampling a task from the train set while keeping the main network frozen.
+The paper proposes the use of task-conditioned HyperNetworks for lifelong learning / continual learning setups.
+The idea is, the HyperNetwork would only need to remember the task-conditioned weights and not the input-output mapping for all the data points.
+$f$ denotes the network for the given $t^{th}$ task.
+$h$ denotes the HyperNetwork that generates the weights for $f$.
+$\Theta_{h}$ denotes the parameters of $h$.
+$e^{t}$ denotes the input task-embedding for the $t^{th}$ task.
+When training on the $t^{th}$ task, the HyperNetworks generates the weights for the network $f$.
+The current task loss is computed using the generated weights, and the candidate weight update ($\Delta \Theta_{h}$) is computed for $h$.
+The actual parameter change is computed by the following expression:
+$L_{total} = L{task}(\Theta_{h}, e^{T}, X^{T}, Y^{T}) + \frac{\beta_{output}}{T-1} \sum_{t=1}^{T-1} | f_{h}(e^{t}, \Theta_{h}^*) - f_{h}(e^{(t)}, \Theta_{h} + \Delta \Theta_{h} ))|^2$
+ +$L_{task}$ is the loss for the current task.
+$(X^{T}, Y^{T})$ denotes the training datapoints for the $T^{th}$ task.
+$\beta_{output}$ is a hyperparameter to control the regularizer’s strength.
+$\Theta_{h}^*$ denotes the optimal parameters after training on the $T-1$ tasks.
+$\Theta_{h} + \Delta \Theta_{h}$ denotes the one-step update on the current $h$ model.
+In practice, the task encoding $e^{t}$ is chunked into smaller vectors, and these vectors are fed as input to the HyperNetwork.
+This enables the HyperNetwork to produce weights iteratively, instead of all at once, thus helping to scale to larger models.
+The paper also considers the problem of inferring the task embedding from a given input pattern.
+Specifically, the paper uses task-dependent uncertainty, where the task embedding with the least predictive uncertainty is chosen as the task embedding for the given unknown task. This approach is referred to as HNET+ENT.
+The paper also considers using HyperNetworks to learn the weights for a task-specific generative model. This generative model will be used to generate pseudo samples for rehearsal-based approaches. The paper considers two cases:
+ +HNET+R where the replay model (i.e., the generative model) is parameterized using a HyperNetwork.
+HNET+TIR, where an auxiliary task inference classifier is used to predict the task identity.
+Three setups are considered
+ +CL1 - Task identity is given to the model.
+CL2 - Task identity is not given, but task-specific heads are used.
+CL3 - Task identity needs to be explicitly inferred.
+On the permuted MNIST task, the proposed approach outperforms baselines like Synaptic Intelligence and Online EWC, and the performance gap is more significant for larger task sequences.
+Forward knowledge transfer is observed with the CIFAR datasets.
+One potential limitation (which is more of a limitation of HyperNetworks) is that HyperNetworks may be harder to scale for larger models like ResNet50 or transformers, thus limiting their usefulness for lifelong learning use cases.
+The paper systematically investigates when does curriculum learning help.
+Implicit curricula refers to the order in which a network learns data points when trained using stochastic gradient descent, with iid sampling of data.
+When training, let us say that the model makes a correct prediction for a given datapoint in the $i^{th}$ epoch (and correct prediction in all the subsequent epochs). The $i^{th}$ epoch is referred to as the learned iteration of the datapoint (iteration in which the datapoint was learned).
+The paper studied multiple models (VGG, ResNet, WideResNet, DenseNet, and EfficientNet) with different optimizers (Adam and SGD with momentum).
+The resulting implicit curricula are broadly consistent within the model families, making the following discussion less dependent on the model architecture.
+Maps a data point to a numerical score of difficulty.
+Choices:
+ +Loss function for a model
+learned iteration
+Estimated c-score - It captures a given model’s consistency to correctly predict a given datapoint’s label when trained on an iid dataset (not containing the datapoint).
+The three scoring functions are computed for two models on the CIFAR dataset.
+The resulting six scores have a high Spearman Rank correlation. Hence for the rest of the discussion, only the c-score is used.
+This function, denoted by $g(t)$, controls the size of the training dataset at step $t$.
+At step $t$, the model would be trained on the first $g(t)$ examples (as per the ordering).
+Choices: logarithmic, exponential, step, linear, quadratic, and root.
+Order in which the data points are picked:
+ +Curriculum - Ordering points from lowest score to highest and training on the easiest data points first.
+Anti Curriculum - Ordering points from highest score to lowest and training on the hardest data points first.
+Random - Randomly selecting the data points to train on.
+The paper performed a hyperparameter sweep over 180 pacing functions and three orderings for three random seeds over the CIFAR10 and CIFAR100 datasets. For both the datasets, the best performance is obtained with random ordering, indicating that curricula did not give any benefits.
+However, the curriculum is useful when the number of training iterations is small.
+It also helps with noisy data training (which is simulated by randomly permuting the labels).
+The observations for the smaller CIFAR10/100 dataset generalize to slightly larger datasets like FOOD101 and FOOD101N.
+The paper studies the effect of catastrophic forgetting on representations in neural networks.
+Techniques:
+ +Representational Similarity Measures
+Layer Freezing
+Layer Reset
+Datasets
+ +Split CIFAR-10
+ +CIFAR-10 dataset is split into m (=2) tasks, where each task is a n way classification task.
+The underlying network has a shared trunk with m heads, one head per task.
+Split CIFAR-100 Distribution Shift
+ +Network Architecture
+ +Are all representations (throughout the network) equally responsible for forgetting?
+ +Higher layer (layers closer to the output) are the primary source of catastrophic forgetting.
+Central Kernel Alignment (CKA) technique is used to compare the similarity between the layer representations, before and after training on the second task.
+Higher layer representations change significantly when training over two tasks while the lower layer representations remain stable.
+When finetuning on the second task, freezing the lower layers has only a minor effect on the accuracy of the second task.
+In layer reset experiments, after training on the second task, the weights of some of the layers are reset to their values after training on the first task.
+ +Do common approaches for countering catastrophic forgetting work by stabilizing the higher layers?
+ +Yes - both EWC and replay-based approaches counter catastrophic forgetting work by stabilizing the higher layers.
+This is demonstrated by showing that as the quadratic penalty for EWC (or fraction of data from replay buffer) increases (to reduce catastrophic forgetting), the representations for higher layers change less during the second task.
+When training over a sequence of tasks, are similar tasks more likely to be forgotten than different tasks?
+ +Setup I
+ +Training over a sequence of two binary classification tasks.
+Task 1: Two related classes (say ship
and truck
)
Task 2: Two related classes, which may or may not be related to the classes for Task 1. For example, the classes could be
+ +cat
and horse
(not related to classes of the first task)
plane
and car
(related to the classes of the first task)
Training over semantically similar tasks (here plane
and car
) leads to less forgetting.
Setup II
+ +Training over a sequence of two classification tasks.
+Task 1: Four classes where the classes can be grouped into two groups (say deer
, dog
, ship
and truck
)
Task 2: Two related classes, which may be related to group 1 or group 2. For example, the classes could be two animals or two objects.
+After training on the second task, classes (from Task 1), which are in the different group as classes from Task 2, are forgotten less.
+Conclusion
+ +Task representational similarity is a function of both underlying data and optimization procedure.
+Forgetting is most severe for task representations of intermediate similarity.
+Representational similarity is necessary but not a sufficient condition for forgetting.
+How does catastrophic forgetting change as the task similarity changes?
+ +If the model learns different representations for dissimilar tasks, increasing dissimilarity can help to avoid forgetting.
+When training the two-task, two-class (per task) CIFAR-10 setup with an “others” class (classes not already used in the setup), the forgetting is reduced.
+The paper presents case studies from the experience of deploying an ad click-through rate (CTR) prediction model at Google.
+The paper focuses on themes related to memory footprint, performance analysis, calibration, confidence in the predictions, and feature engineering.
+Features (corresponding to a given ad) include search query and the metadata in the ad. The features are very sparse.
+Single layer, regularized Logistic Regression model is trained with Online Gradient Descent (same as Stochastic Gradient Descent, but in the online setting).
+From a memory perspective, it is important to minimize the size of the final model.
+Adding just the L1 penalty is not sufficient to produce weights that are precisely equal to 0.
+“Follow The (Proximally) Regularized Leader” algorithm or FTRL-Proximal algorithm is used to learn sparse models without losing on the accuracy.
+Using per-coordinate learning rates improves the performance at the cost of memory as both the sum of gradients and the sum of the square of gradients are tracked for each feature.
+ +In practice, some of the cost can be alleviated by approximating that all the events containing a given feature have the same probability.
+In such a case, the sum of the square of gradients can be approximated using the counts of positive and negative events alone.
+Some memory overhead can be reduced based on the following observation: the vast majority of features are extremely rare. Hence, it is not necessary to track the statistics for such rare features.
+ +However, in an online setting, it is not known upfront as to which features will be sparse.
+The paper proposes to use probabilistic feature inclusion - a feature is added to the model with probability $p$. Once it is added, the feature is not removed.
+An alternative approach is to use a rolling set of counting Bloom filters to check if a feature has appeared at least $n$ times in training. Bloom filters are probabilistic data structures and can return false positives.
+Memory can also be saved by using fewer bits for encoding weights.
+ +Most of the weight coefficients lie in the range $(-2, 2)$, and a $16-$ bit encoding is used in place of $32$ or $64$ bit encoding.
+This quantization approach needs to account for roundoff problems. The fix is easy to implement.
+When training many models with similar hyperparameters, per-model learning rate counters can be replaced by statistics shared by all the models, thus reducing memory footprint.
+A Single Value Structure is used to reduce the memory footprint when evaluating a very large set of model variants that differ only in addition/removal of a small subset of features.
+ +All the models, that use a feature, share a single value structure corresponding to the feature. This reduces the memory overhead by order of magnitude.
+During the update, each model computes the weight updates corresponding to all the features that it is using. The updated weight is averaged across all the models and used to update the single value structure.
+Since CTR datasets are generally highly imbalanced, the training data (for the negative class) can be subsampled to reduce the amount of data to train over. The loss component (corresponding to negative class) can be appropriately scaled up.
+Metrics
+ +Offline metrics like AucLoss (1 - AUC), Log Loss, Squared Error
+Online loss is computed on the new training data (new incoming traffic) before training on it.
+The confidence in the model’s prediction is estimated using a heuristic called uncertainty score. It can be measured using the dot product of the feature and the vector of learning rates.
+ +The idea is that the learning rates already maintain a notion of uncertainty.
+Features for which the learning rate is high are the features for which uncertainty is also high.
+Calibrating Predictions
+ +The calibration can be improved by applying correction functions $\tau_d(p)$ where $p$ is the predicted CTR, and $d$ is an element of a partition of the training data.
+$\tau$ can be modeled as $\gamma^{\kappa}$ where $\gamma$ and $\kappa$ are learned using Poisson regression.
+Unsuccessful Experiments
+ +Aggressive feature hashing was tried to reduce the memory overhead. However, it leads to a significant loss in performance.
+Using dropout did not help, probably because the features are sparse.
+Using feature bagging hurt the AucLoss.
+Feature vector normalization did not improve performance, probably because of per-coordinate learning rates and regularization.
+The paper describes several design choices for developing a model for predicting user response (clicks) on ads.
+The model is trained/evaluated on offline data.
+Evaluation metrics:
+ +Normalized Cross-Entropy (or Normalized Entropy, NE)
+ +Defined as the predictive log-loss per impression, divided by the entropy of the background CTR (click-through rate).
+Background CTR is the average empirical CTR of the training data.
+Lower normalized cross-entropy is better.
+The normalization term is important to make the metric insensitive to the background CTR. Otherwise, the log loss can easily be made low when background CTR is close to 0 or 1.
+NE can also be written as $RIG - 1$, where $RIG$ is the Relative Information Gain.
+Calibration
+ +Area-Under-ROC (AUC) is a good metric for measuring ranking quality (among ads). However, it is not used as a metric to avoid over-delivery or under-delivery of ads.
+Feature Transformation
+ +A given add impression, $e$, is transformed into a $n-$dimensional vector, $x$, where the $i^{th}$ index denotes the value of the $i^{th}$ categorical feature.
+Continous features are binned, and the bin index is used as a categorical feature, thus applying a non-linear transformation to the features.
+Categorical features that are tuple-like (i.e., have a tuple of values) can be converted into new categorical features by taking a cartesian product.
+Boosted decision trees can be used to implement the previous two transformations in one go.
+ +Each tree is used as a categorical feature that takes the value of the index of the leaf node than an ad maps to.
+The paper used the Gradient Boosting Machine with the $L_2-$TreeBoost algorithm.
+Using the tree feature transformation improves the Normalized Cross-Entropy by $3.4\%$.
+Model
+ +Logistic Regression (LR) or Bayesian online learning scheme for probit regression (BOPR) algorithms are used for training a linear classifier model.
+While both LR and BOPR models provide similar performance, the LR model is half the BOPR model’s size and faster for performing training/inference.
+When a model is trained on the data from a particular day and evaluated on data from the subsequent days, the model’s performance degrades as the delay between training and test set increases.
+This highlights the importance of the freshness of the training data.
+One straightforward approach can be to train the model every day.
+Alternatively, the linear classifier can be trained using online learning, while the boosted decision tree can still be trained daily.
+Different choices for setting the learning rate (for online training of linear classifier) are compared, and the per-coordinate learning rate is found to perform best in practice.
+An “online joiner” system is used to generate real-time training data for the linear classifier.
+The challenging part is, while there are data points with a “positive” label (i.e., the user clicked on the ad), there are no datapoints with a “negative” label (since there is no “no-click” button that the user can click).
+An impression is considered to have the “no-click” label if the user does not click on the ad within a (long) time window of seeing the ad.
+Too short a time window could mislabel some impressions, while too long a time window will delay the real-time training data.
+The online joiner performs a distributed stream-to-stream join on the stream of ad impressions and stream of ad clicks using a HashQueue.
+A HashQueue:
+ +comprises of a First-In-First-Out queue as a buffer window and a hash map for fast random access to label impressions.
+supports three operations on key-value pairs: enqueue, dequeue, and lookup.
+Increasing the number of boosting trees shows diminishing returns, and most of the improvements come from the first 500 trees.
+Top 10 features account for half of the total feature importance, while the last 300 features add less than 1% feature importance.
+Features in the boosting model can be broadly classified as contextual or historical.
+Historical feature provides much more explanatory power than the contextual features through contextual features are helpful to handle the cold start problem.
+Models trained with just the contextual features rely more heavily on data freshness than models trained with just the historical features.
+Uniform subsampling and negative downsampling techniques are used to limit the amount of training data.
+In the case of negative downsampling, the model needs to be re-calibrated as well.
+The paper presents some causes for (temporary) high-latency episodes in large-scale online systems and techniques to mitigate their impact so that the tail of latency distribution remains short.
+Shared resources between processes on the same node
+Background processes (daemons) could use cause a momentary spike in resource usage.
+Processes running on different nodes may contend for global resources like shared file systems.
+Maintenance activities like disk compaction or garbage collection.
+Others like queueing, power limits, or energy management.
+In the case of large-scale systems, the component-level variability is further amplified.
+Use differentiated service classes to prioritize user requests over non-interactive requests.
+Reduce head-of-line blocking by breaking long-running requests into smaller requests.
+Synchronize maintenance jobs across nodes to minimize the window for high latency.
+Caching generally does not help to address tail latency.
+Two categories of adaptation approaches
+ +Within Request Short-Term Adaptations
+ +These approaches are more relevant for services that perform many read queries on loosely consistent datasets.
+Hedged Request
+ +Send the request to multiple replicas, and once one of the replicas returns the result, cancel the other requests.
+In practice, start by sending the request to only one replica. Send the secondary requests if the first request is outstanding for more than $95^{th}$ percentile of expected latency.
+This introduces an additional $5\%$ load while substantially shortening the latency tail.
+This approach work because often, the cause of latency is not the query itself but other factors like overloaded nodes.
+Tied Request
+ +Hedged request approach makes a tradeoff regarding how long to wait before initiating requests to other replicas. The sooner the request is made, the lower should be the latency in serving the request, but more will be the overall load in the system.
+The load in the system can be reduced by “tieing” requests (sent to different replicas) so that as soon as one replica starts processing the request, it can notify the other replicas, which could drop the request or deprioritize it.
+In practice, “tieing” requests means that each replica has the identity of other replicas which may execute the request.
+Note that there is a short window (of the average network message delay) when multiple replicas could start executing the request. This can be mitigated if the client (issuing the requests) introduces a delay to twice the average network message delay.
+Submit the request to the least loaded replica
+ +Cross-Request Long-Term Adaptations
+ +These approaches are more relevant for situations where different services have different throughput.
+Micro-partitions
+ +Generate more paritions than the number of nodes.
+The partitions can be dynamically assigned to machines to ensure proper load balancing.
+In case of machine failure, many nodes can be used to quickly re-create the micro-partitions instead of waiting on one machine to read one single large partition.
+Selective Replication
+ +Latency induced probation
+ +Large Information Retrieval Systems
+ +In such systems, speed can be more critical than the quality of the result.
+The system should return a “good enough” result that is available with low latency instead of waiting for the “best result” that is available with high latency.
+In some cases, a request could trigger an unexpected code path or cause some other exception that could slow down the entire system.
+In such cases, the canary request technique can be used where the system sends the request initially to only 1 or 2 nodes. The request is sent over to the other nodes only after receiving a successful response from the initial nodes.
+Requests that update state are easier to handle for several reasons:
+ +The scale of latency-critical modifications is generally small.
+The update can be performed asynchronously after responding to the user.
+Quorum-based approaches (often used for ensuring consistent updates) are inherently tail-tolerant.
+The paper describes YouTube’s deep learning-based recommendation system.
+Scale - Very large number of users and videos.
+Freshness - Very large number of videos uploaded every hour. The recommendation system should take these new videos into account as well.
+Noise - User satisfaction needs to be modeled from noisy implicit feedback signal as the explicit signal is very sparse.
+Two neural networks: one for candidate generation and another one for ranking.
+Metrics
+ +Offline metrics like precision, recall, ranking loss
+A/B testing via live experiments
+Input: events from a user’s YouTube activity history.
+Output: small subset (hundreds) of videos.
+Approach:
+ +Recommendation is modeled as extreme multiclass classification.
+Predict the video (from a corpus) that a user will watch at a given time.
+The neural network’s task is to learn useful user embeddings, given the user’s context and history.
+For each positive class (relevant video), negative classes (non-relevant videos) are sampled from the video corpus.
+Model Architecture
+ +A feedforward network with input as user embeddings and context embeddings (watch history).
+Watch history is a variable-length sequence of video ids, where each video id is mapped to an embedding.
+The sequence of video ids is mapped to a sequence of embeddings, and this sequence is averaged to obtain fixed-sized embedding.
+Additional signals like demographic features and search query embeddings can be added along with the context embeddings.
+The age of a video is also used as a feature during training to account for the freshness of the content. This feature is set to zero (or slightly negative) during inference.
+Other Insights
+ +Training examples are generated from all YouTube watches, including the watches from the videos embedded on other sites, to surface new content.
+Generating the same number of training examples per user is important to avoid a small set of active users from dominating the model training.
+Predicting a user’s next watch leads to better results than predicting a randomly held-out watch. This can be attributed to the general consumption pattern of videos (e.g., episodes are usually watched in order).
+Input: list of candidate videos to rank from.
+Output: score for each video.
+Approach
+ +Feature representation
+ +Different types of features: categorical vs. continuous, univalent vs. multivalent, describes video vs. describes user or context.
+Important signals include user’s interaction with the video (or similar videos), which source/channel added the video to the candidate set.
+Embeddings are shared across features. For example, the representation for a video id remains the same, irrespective of whether it is being used for representing the “video to recommend” or the “last seen video.”
+Feature normalization and transformations like exponents (square or square root) for continuous variables improve the performance.
+To model the expected watch time, the logistic regression loss is weighted by the observed watch time. For example, if a video was watched, its weight is given by the observed watch time, and if the video was not watched, its weight is set to 1.
+In practice, this means that the logistic regression model learns odds that approximate the expected watch time of the video.
+The paper studies transfer learning in RL, focusing on simultaneous transfer across both tasks and environments.
+The key idea is to learn task and environment embeddings and compose them using a meta-rule, and the proposed approach is called SYNPO (Synthesized Policies).
+Three settings considered:
+ +S1: Transfer to a new (environment, task) pair when the agent has been trained on the environment and the task before (but not simultaneously).
+S2: Transfer to a new (environment, task) pair where either the environment or the task is not seen previously.
+S3: Transfer to a new (environment, task) pair where neither the environment nor the task is seen previously.
+In the second and third settings, the agent is allowed to collect some data in the new environment or task.
+The (environment, task) combinations that the agent has seen during training are referred to as seen combinations, while the remaining combinations are referred to as the unseen combinations.
+The key idea is to:
+ +learn embeddings of environments and tasks
+use these embeddings to compose a policy (parameterized as the linear combination of the policy basis).
+A disentanglement objective is used to decouple the task and environment embedding.
+Given an (environment, task) pair $z = (\epsilon, \tau)$, the policy is given as $\pi_z(a|s) \propto exp(\psi_s^TU(e_{\epsilon}, e_{\tau})\phi_{a} + b_{\pi} )) $.
+Here $b_{\pi}$ is a scalar bias, $\psi_{s}$ and $\phi_{a}$ are state and action representations, $U$ is parameterized as the linear comination of $K$ basis matrices $\Theta_k$
+$U(e_{\epsilon}, e_{\tau}) = \sum_{k=1}^{K}\alpha_k(e_{\epsilon}, e_{\tau})\Theta_k$.
+The basis matrices (denoted by $\Theta_k$) are shared across tasks while the coefficients ($\alpha_k$) are specific to the (environment, task) pair.
+During training, the agent also predicts rewards using the same set of basis but different coefficients.
+Given an (environment, task) pair, the agent is trained to decode the environment (and task) given the agent’s trajectory.
+The sequence of state-action pairs (in the trajectory) is mapped to a sequence of state-action representations, given by $\psi_s^T\Theta_k\phi_{a}$
+GRIDWORLD
+ +Twenty $16 \times 16$ gird-aligned mazes that are similar in appearance but differ in topology.
+The task is to collect colored blocks in a given order. In each task, the starting position of the agent and the position of the blocks is randomized.
+Each environment has 20 tasks, leading to a total of 400 (environment, task) combinations.
+This is a 3D simulator where the agent is placed in indoor photo-realistic scenes.
+The task is the search for objects and perform actions like “put cabbage on the fridge.”
+The setup uses 19 scenes (environments), with each environment comprising of 21 tasks.
+MLPs that concatenate state, environment embeddings, and task embedding.
+Multi-task Learning where the distinction between the environments is ignored.
+GRIDWORLD
+ +In the first setting (S1)
+ +SYNPO outperforms all the baselines.
+As the agent is trained on more (environment, task) combinations, its performance on the unseen combinations improves. This trend saturates when the seem/total ratio reaches about 0.4 (i.e., training on 40% of all the combinations).
+Task disentanglement is more important than environment disentanglement.
+In the second and third setting (S2 and S3)
+ +The agent uses one demonstration from each test pair to finetune the embeddings.
+S2 is an easier setting than S3.
+Transfer learning across tasks is easier than transfer learning across environments.
+THOR
+ +The paper presents Toolformer, a language model that uses simple APIs to use external tools (calculator, QA system, search engine, translation system, and calendar).
+Starting with a language model, M, the goal is to enable the language model to use tools by invoking API calls.
+An API call is denoted by the tuple $c =$ (api_name, api_input). It can be linearized as $e(c) =$ [api_name(api_input)$]$ or as $e(c, r) = [$api_name(api_input) $ -> r]$ where $r$ denotes the result of the API.
+The given dataset of plain text, $C$, is converted into a dataset $C*$ augmented with the API calls using a three-step process.
+In the first step, a position ($i$) and API call candidates (for the position $i$) are sampled.
+ +Positions are sampled by (i) computing the probability that M assigns to starting an API call for each position and (ii) retaining the top-$k$ positions with a probability greater than a threshold value.
+For each of the sampled positions (say $i$), API calls are sampled by concatenating a prompt to the tokens till index $i$ and sampling from the model M. Examples that do not generate the “end of the API” token (i.e.,”]”) are discarded.
+In the second step, the API calls are executed to obtain response $r$ (text sequence).
+ +In the last step, the remaining API calls are merged to obtain the augmented dataset $C*$ that is used for finetuning M.
+Note that $C*$ contains $C$, so M is finetuned on the original dataset and examples where a tool is helpful.
+During inference, the model is used for decoding in the usual way. Decoding is stopped when it produces the “->” token, and the corresponding API is used to generate the response. The decoding process (using the model) resumes with the API output appended to the decoded text.
+There are two constraints on the tools: (i) their input and output should be expressible as text, and (ii) few demonstrations can be obtained from the tools. The second constraint means that the tool should be useable or accessible.
+The paper considered the following tools: a question-answering system, a Wikipedia search engine, a calculator, a calendar, and a machine translation system. Of these, only the calculator and calendar are non-neural network tools.
+Subset of CCNet is used as the language modeling dataset.
+GPT-J is used as the language model.
+For finetuning, the batch size is 128, the learning rate is 1e-5, and a linear warmup for the first 10% of training is used.
+Following models are compared:
+ +GPT-J: Regular GPT-J model without any finetuning.
+GPT-J + CC: GPT-J finetuned on $C$ without any API calls.
+Toolformer, i.e. GPT-J finetuned on $C*$.
+Toolformer with API calls disabled during training.
+OPT 66B
+GPT-3
+The models are evaluated in the prompted zero-shot setup, where models are instructed to solve a task without any in-context examples.
+One difference from the standard greedy decoding is that the API call is used whenever it is one of the top-10 most likely next tokens. This is done to increase the use of API calls.
+Evaluation Tasks
+ +SQuAD, GoogleRE, and T-REx subsets of the LAMA benchmark where the model has to complete a short statement with a missing fact.
+ +Since LAMA questions are based on Wikipedia, Toolformer isn’t allowed to use Wikipedia search.
+The evaluation criteria is to check if the correct word is among the first five words predicted by the model.
+Toolformer uses the question-answering tool for most cases, outperforming all the baselines.
+Math Dataset
+ +eSDiv, SVAMP, and MAWPS benchmarks.
+The first number predicted by the model is considered to be the output.
+Toolformer uses the calculator tool for most cases, thereby outperforming all the baselines.
+Question Answering
+ +Web Questions, Natural Questions, and TriviaQA datasets.
+The evaluation criteria is to check if the correct word is among the first 20 words predicted by the model.
+Question Answering tool is disabled for this setup.
+Toolformer uses the Wikipedia tool for most cases, thereby outperforming all the baselines other than the much larger GPT-3 model.
+Multilingual Question Answering
+ +MLQA benchmark.
+The evaluation criteria is to check if the correct word is among the first ten words predicted by the model.
+Toolformer uses the translation tool for most of the questions, with questions in Hindi being an exception.
+However, Toolformer does not consistently outperform the GPT-J baseline, likely because, for some languages, finetuning on CCNet could hurt performance.
+Temporal Datasets
+ +TEMPLAMA (cloze style queries where the answer changes with time) and DATESET (dataset generated through a series of templates and populated with random dates/durations).
+While Toolformer outperforms the baselines for both datasets, it relies on the Wikipedia search and Question Answering tools (and not the calendar tool) for the LAMA dataset. On the DATESET dataset, it uses the calendar tool in the majority.
+Language Modeling
+ +WikiText and a subset of 10,000 randomly selected documents from CCNet (not used during training of M).
+Training on $C*$ does not increase perplexity (compared to training on C). In this experiment, the API calls are disabled during inference.
+Varying the size of the underlying models show that the ability to use tools emerges only around 755M parameters.
+Extending Toolformer to chain the use of tools and use tools interactively.
+In some cases, the use of tools is very sample-inefficient.
+Decision to use a tool does not account for the cost of using the tool.
+10 Feb 2023 » Toolformer - Language Models Can Teach Themselves to Use Tools
+29 Mar 2021 » Synthesized Policies for Transfer and Adaptation across Tasks and Environments
+22 Mar 2021 » Deep Neural Networks for YouTube Recommendations
+15 Mar 2021 » The Tail at Scale
+08 Mar 2021 » Practical Lessons from Predicting Clicks on Ads at Facebook
+01 Mar 2021 » Ad Click Prediction - a View from the Trenches
+22 Feb 2021 » Anatomy of Catastrophic Forgetting - Hidden Representations and Task Semantics
+15 Feb 2021 » When Do Curricula Work?
+08 Feb 2021 » Continual learning with hypernetworks
+01 Feb 2021 » Zero-shot Learning by Generating Task-specific Adapters
+25 Jan 2021 » HyperNetworks
+18 Jan 2021 » Energy-based Models for Continual Learning
+11 Jan 2021 » GPipe - Easy Scaling with Micro-Batch Pipeline Parallelism
+04 Jan 2021 » Compositional Explanations of Neurons
+21 Dec 2020 » Design patterns for container-based distributed systems
+14 Dec 2020 » Cassandra - a decentralized structured storage system
+07 Dec 2020 » CAP twelve years later - How the rules have changed
+30 Nov 2020 » Consistency Tradeoffs in Modern Distributed Database System Design
+23 Nov 2020 » Exploring Simple Siamese Representation Learning
+16 Nov 2020 » Data Management for Internet-Scale Single-Sign-On
+09 Nov 2020 » Searching for Build Debt - Experiences Managing Technical Debt at Google
+02 Nov 2020 » One Solution is Not All You Need - Few-Shot Extrapolation via Structured MaxEnt RL
+19 Oct 2020 » Learning Explanations That Are Hard To Vary
+12 Oct 2020 » Remembering for the Right Reasons - Explanations Reduce Catastrophic Forgetting
+28 Sep 2020 » A Foliated View of Transfer Learning
+21 Sep 2020 » Harvest, Yield, and Scalable Tolerant Systems
+14 Sep 2020 » MONet - Unsupervised Scene Decomposition and Representation
+07 Sep 2020 » Revisiting Fundamentals of Experience Replay
+31 Aug 2020 » Deep Reinforcement Learning and the Deadly Triad
+24 Aug 2020 » Alpha Net–Adaptation with Composition in Classifier Space
+14 Aug 2020 » Outrageously Large Neural Networks–The Sparsely-Gated Mixture-of-Experts Layer
+06 Aug 2020 » Gradient Surgery for Multi-Task Learning
+30 Jul 2020 » GradNorm–Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks
+23 Jul 2020 » TaskNorm–Rethinking Batch Normalization for Meta-Learning
+16 Jul 2020 » Averaging Weights leads to Wider Optima and Better Generalization
+09 Jul 2020 » Decentralized Reinforcement Learning – Global Decision-Making via Local Economic Transactions
+02 Jul 2020 » When to use parametric models in reinforcement learning?
+25 Jun 2020 » Network Randomization - A Simple Technique for Generalization in Deep Reinforcement Learning
+18 Jun 2020 » On the Difficulty of Warm-Starting Neural Network Training
+30 Apr 2020 » Supervised Contrastive Learning
+09 Apr 2020 » CURL - Contrastive Unsupervised Representations for Reinforcement Learning
+12 Mar 2020 » Competitive Training of Mixtures of Independent Deep Generative Models
+05 Mar 2020 » What Does Classifying More Than 10,000 Image Categories Tell Us?
+27 Feb 2020 » mixup - Beyond Empirical Risk Minimization
+20 Feb 2020 » ELECTRA - Pre-training Text Encoders as Discriminators Rather Than Generators
+13 Feb 2020 » Gradient based sample selection for online continual learning
+06 Feb 2020 » Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One
+30 Jan 2020 » Massively Multilingual Neural Machine Translation in the Wild - Findings and Challenges
+23 Jan 2020 » Observational Overfitting in Reinforcement Learning
+16 Jan 2020 » Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML
+09 Jan 2020 » Accurate, Large Minibatch SGD - Training ImageNet in 1 Hour
+02 Jan 2020 » Superposition of many models into one
+26 Dec 2019 » Towards a Unified Theory of State Abstraction for MDPs
+19 Dec 2019 » ALBERT - A Lite BERT for Self-supervised Learning of Language Representations
+12 Dec 2019 » Everything Happens for a Reason - Discovering the Purpose of Actions in Procedural Text
+05 Dec 2019 » Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
+28 Nov 2019 » Contrastive Learning of Structured World Models
+12 Sep 2019 » Gossip based Actor-Learner Architectures for Deep RL
+05 Sep 2019 » How to train your MAML
+29 Aug 2019 » PHYRE - A New Benchmark for Physical Reasoning
+22 Aug 2019 » Large Memory Layers with Product Keys
+15 Aug 2019 » Abductive Commonsense Reasoning
+08 Aug 2019 » Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
+01 Aug 2019 » Assessing Generalization in Deep Reinforcement Learning
+25 Jul 2019 » Quantifying Generalization in Reinforcement Learning
+18 Jul 2019 » Set Transformer - A Framework for Attention-based Permutation-Invariant Neural Networks
+27 Jun 2019 » Measuring abstract reasoning in neural networks
+20 Jun 2019 » Hamiltonian Neural Networks
+13 Jun 2019 » Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations
+08 Jun 2019 » Meta-Reinforcement Learning of Structured Exploration Strategies
+01 Jun 2019 » Relational Reinforcement Learning
+21 May 2019 » Good-Enough Compositional Data Augmentation
+14 May 2019 » Multiple Model-Based Reinforcement Learning
+09 Apr 2019 » Towards a natural benchmark for continual learning
+02 Apr 2019 » Meta-Learning Update Rules for Unsupervised Representation Learning
+26 Mar 2019 » GNN Explainer - A Tool for Post-hoc Explanation of Graph Neural Networks
+16 Mar 2019 » To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks
+12 Mar 2019 » Model Primitive Hierarchical Lifelong Reinforcement Learning
+19 Feb 2019 » TuckER - Tensor Factorization for Knowledge Graph Completion
+05 Feb 2019 » Linguistic Knowledge as Memory for Recurrent Neural Networks
+29 Jan 2019 » Diversity is All You Need - Learning Skills without a Reward Function
+22 Jan 2019 » Modular meta-learning
+15 Jan 2019 » Hierarchical RL Using an Ensemble of Proprioceptive Periodic Policies
+08 Jan 2019 » Efficient Lifelong Learning with A-GEM
+02 Jan 2019 » Pre-training Graph Neural Networks with Kernels
+25 Dec 2018 » Smooth Loss Functions for Deep Top-k Classification
+18 Dec 2018 » Hindsight Experience Replay
+11 Dec 2018 » Representation Tradeoffs for Hyperbolic Embeddings
+01 Nov 2018 » Learned Optimizers that Scale and Generalize
+25 Oct 2018 » One-shot Learning with Memory-Augmented Neural Networks
+18 Oct 2018 » BabyAI - First Steps Towards Grounded Language Learning With a Human In the Loop
+11 Oct 2018 » Poincaré Embeddings for Learning Hierarchical Representations
+04 Oct 2018 » When Recurrent Models Don’t Need To Be Recurrent
+27 Sep 2018 » HoME - a Household Multimodal Environment
+12 Sep 2018 » Emergence of Grounded Compositional Language in Multi-Agent Populations
+21 Aug 2018 » A Semantic Loss Function for Deep Learning with Symbolic Knowledge
+16 Aug 2018 » Hierarchical Graph Representation Learning with Differentiable Pooling
+08 Aug 2018 » Imagination-Augmented Agents for Deep Reinforcement Learning
+19 Jul 2018 » Kronecker Recurrent Units
+11 Jul 2018 » Learning Independent Causal Mechanisms
+04 Jul 2018 » Memory-based Parameter Adaptation
+09 Jun 2018 » Born Again Neural Networks
+21 May 2018 » Net2Net-Accelerating Learning via Knowledge Transfer
+06 May 2018 » Learning to Count Objects in Natural Images for Visual Question Answering
+08 Apr 2018 » Neural Message Passing for Quantum Chemistry
+02 Apr 2018 » Unsupervised Learning by Predicting Noise
+25 Mar 2018 » The Lottery Ticket Hypothesis - Training Pruned Neural Networks
+18 Mar 2018 » Cyclical Learning Rates for Training Neural Networks
+11 Mar 2018 » Improving Information Extraction by Acquiring External Evidence with Reinforcement Learning
+05 Mar 2018 » An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks
+24 Feb 2018 » Learning an SAT Solver from Single-Bit Supervision
+17 Feb 2018 » Neural Relational Inference for Interacting Systems
+11 Feb 2018 » Stylistic Transfer in Natural Language Generation Systems Using Recurrent Neural Networks
+05 Feb 2018 » Get To The Point - Summarization with Pointer-Generator Networks
+29 Jan 2018 » StarSpace - Embed All The Things!
+22 Jan 2018 » Emotional Chatting Machine - Emotional Conversation Generation with Internal and External Memory
+14 Jan 2018 » Exploring Models and Data for Image Question Answering
+06 Jan 2018 » How transferable are features in deep neural networks
+31 Dec 2017 » Distilling the Knowledge in a Neural Network
+24 Dec 2017 » PTE - Predictive Text Embedding through Large-scale Heterogeneous Text Networks
+11 Dec 2017 » Revisiting Semi-Supervised Learning with Graph Embeddings
+28 Nov 2017 » Two-Stage Synthesis Networks for Transfer Learning in Machine Comprehension
+19 Nov 2017 » Higher-order organization of complex networks
+12 Nov 2017 » Network Motifs - Simple Building Blocks of Complex Networks
+05 Nov 2017 » Word Representations via Gaussian Embedding
+28 Oct 2017 » HARP - Hierarchical Representation Learning for Networks
+22 Oct 2017 » Swish - a Self-Gated Activation Function
+15 Oct 2017 » Reading Wikipedia to Answer Open-Domain Questions
+01 Oct 2017 » Task-Oriented Query Reformulation with Reinforcement Learning
+22 Sep 2017 » Refining Source Representations with Relation Networks for Neural Machine Translation
+27 Aug 2017 » Pointer Networks
+21 Aug 2017 » Learning to Compute Word Embeddings On the Fly
+07 Aug 2017 » R-NET - Machine Reading Comprehension with Self-matching Networks
+24 Jul 2017 » ReasoNet - Learning to Stop Reading in Machine Comprehension
+17 Jul 2017 » Principled Detection of Out-of-Distribution Examples in Neural Networks
+09 Jul 2017 » Ask Me Anything - Dynamic Memory Networks for Natural Language Processing
+01 Jul 2017 » One Model To Learn Them All
+26 Jun 2017 » Two/Too Simple Adaptations of Word2Vec for Syntax Problems
+17 Jun 2017 » A Decomposable Attention Model for Natural Language Inference
+03 Jun 2017 » A Fast and Accurate Dependency Parser using Neural Networks
+23 May 2017 » Neural Module Networks
+14 May 2017 » Making the V in VQA Matter - Elevating the Role of Image Understanding in Visual Question Answering
+07 May 2017 » Conditional Similarity Networks
+28 Apr 2017 » Simple Baseline for Visual Question Answering
+27 Apr 2017 » VQA-Visual Question Answering
+