diff --git a/README.md b/README.md index 45577da2..6fb7864d 100755 --- a/README.md +++ b/README.md @@ -4,6 +4,14 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho ## List of papers +* [CAP twelve years later - How the rules have changed](https://shagunsodhani.com/papers-I-read/CAP-twelve-years-later-How-the-rules-have-changed) +* [Consistency Tradeoffs in Modern Distributed Database System Design](https://shagunsodhani.com/papers-I-read/Consistency-Tradeoffs-in-Modern-Distributed-Database-System-Design) +* [Exploring Simple Siamese Representation Learning](https://shagunsodhani.com/papers-I-read/Exploring-Simple-Siamese-Representation-Learning) +* [Data Management for Internet-Scale Single-Sign-On](https://shagunsodhani.com/papers-I-read/Data-Management-for-Internet-Scale-Single-Sign-On) +* [Searching for Build Debt - Experiences Managing Technical Debt at Google](https://shagunsodhani.com/papers-I-read/Searching-for-Build-Debt-Experiences-Managing-Technical-Debt-at-Google) +* [One Solution is Not All You Need - Few-Shot Extrapolation via Structured MaxEnt RL](https://shagunsodhani.com/papers-I-read/One-Solution-is-Not-All-You-Need-Few-Shot-Extrapolation-via-Structured-MaxEnt-RL) +* [Learning Explanations That Are Hard To Vary](https://shagunsodhani.com/papers-I-read/Learning-Explanations-That-Are-Hard-To-Vary) +* [Remembering for the Right Reasons - Explanations Reduce Catastrophic Forgetting](https://shagunsodhani.com/papers-I-read/Remembering-for-the-Right-Reasons-Explanations-Reduce-Catastrophic-Forgetting) * [A Foliated View of Transfer Learning](https://shagunsodhani.com/papers-I-read/A-Foliated-View-of-Transfer-Learning) * [Harvest, Yield, and Scalable Tolerant Systems](https://shagunsodhani.com/papers-I-read/Harvest,-Yield,-and-Scalable-Tolerant-Systems) * [MONet - Unsupervised Scene Decomposition and Representation](https://shagunsodhani.com/papers-I-read/MONet-Unsupervised-Scene-Decomposition-and-Representation) diff --git a/site/_posts/2020-10-12-Remembering for the Right Reasons - Explanations Reduce Catastrophic Forgetting.md b/site/_posts/2020-10-12-Remembering for the Right Reasons - Explanations Reduce Catastrophic Forgetting.md new file mode 100755 index 00000000..4337cb6a --- /dev/null +++ b/site/_posts/2020-10-12-Remembering for the Right Reasons - Explanations Reduce Catastrophic Forgetting.md @@ -0,0 +1,56 @@ +--- +layout: post +title: Remembering for the Right Reasons - Explanations Reduce Catastrophic Forgetting +comments: True +excerpt: +tags: ['2020', 'Catastrophic Forgetting', 'Continual Learning', 'Lifelong Learning', 'Replay Buffer', AI, CL, LL] + +--- + +## Introduction + +* The paper hypothesizes that catastrophic forgetting can happen if the model can not rely on "reasoning" used for an old datapoint. If that is the case, catastrophic forgetting may be alleviated when the model "remembers" why it made a prediction previously. +* The paper presents a simple instantiation of this hypothesis, in the form of a technique called Remembering for the Right Reasons (RRR). +* The idea is to store model explanations, along with previous examples in the replay buffer. During replay, an additional *explanation loss* is used, along with the regular replay loss. +* [Link to the paper](https://arxiv.org/abs/2010.01528) +* [Link to the code](https://github.com/SaynaEbrahimi/Remembering-for-the-Right-Reasons) + +## Setup + +* The model is trained over a sequence of data distributions in the class-incremental learning setup. A single-head architecture is used so that the task ID is not required during inference. +* Along with the standard replay buffer ($$M^{rep}$$) for the raw input examples (from different tasks), another replay buffer ($$M^{RRR}$$) is maintained for storing the "explanations" (in the form of saliency maps), corresponding to examples in $$M^{rep}$$. +* RRR is implemented as an L1 loss on the error between the saliency map generated after training on the current task and the saliency map in $$M^{RRR}$$. +* Saliency maps need to be generated while the model is training. This requirement rules out black-box saliency methods, which can be used only after training. +* The gradient-based white-box explainability techniques that are used include: + * Vanilla backpropagation - Perform a forward pass through the model and take the gradient of the given output class with respect to the input. + * Backpropagation with SmoothGrad - Saliency maps generated using Vanilla backpropagation can be visually noisy. These maps can be improved by adding pixel-wise Gaussian noise to *n* copies of the image and averaging the resulting gradients. The paper used *n=40*. + * Gradient-weighted Class Activation Mapping (Grad-CAM) - Uses gradients to determine the importance of feature map activations on a given prediction. +* RRR can be easily used with memory and regularization based approaches. +* The paper combined RRR with the following standard Class Incremental Learning (CIL) models: + * [iTAML : An incremental task-agnostic meta-learning approach](https://arxiv.org/abs/2003.11652) + * [End-to-end incremental learning (EEIL)](https://arxiv.org/abs/1807.09536) + * [Large scale incremental learning (BiC)](https://arxiv.org/abs/1905.13260) + * [TOpology-Preserving knowledge InCrementer (TOPIC)](https://arxiv.org/abs/2004.10956) + * [iCaRL: Incremental Classifier and Representation Learning](https://arxiv.org/abs/1611.07725) + * [Elastic Weight Consolidation](https://arxiv.org/abs/1612.00796) + * [Learning without forgetting](https://arxiv.org/abs/1606.09282) + +## Experiments + +### Few-Shiot Class Incremental Learning + +* C-way K-shot class incremental learning with C classes and K training samples per class and b base classes to learn as the first task. +* Caltech-UCSD Birds dataset with 100 base classes and remaining 100 classes divided into ten tasks, with three samples per class. The test set is not changed. +* In teems of saliency maps., Grad-CAM is better than Vanilla Backpropagation, which in turn is comparable to SmoothGrad. The same trend is seen in terms of memory overhead, with Grad-CAM having the least memory overhead. +* Adding the RRR loss improves the performance of all the baselines. + +### Standard Class Incremental Learning + +* CIFAR100 and ImageNet100 with a memory budget of 2000 samples. +* Adding the RRR loss improves all the baselines' performance, and the gains for ImageNet100 are more significant than the gains for CIFAR100. + +### How often does the model remember its decision for the right reason? + +* The paper uses the Pointing Game (PG) experiment, which uses the ground truth image segmentation to define the true object region. +* If the maximum attention location (in the predicted saliency map) falls inside the objects, it is considered a *hit*, else a *miss*. A *hit* on a previous example is considered a proxy for the model remembering its decision for the right reason. +* The precision and recall are reported for the *hit* metric. Using RRR increases both precision (i.e., less often the model makes the correct decision without looking at the right evidence) and recall (i.e., less frequently does the model makes an incorrect decision, despite looking at the proper evidence). \ No newline at end of file diff --git a/site/_posts/2020-10-19-Learning Explanations That Are Hard To Vary.md b/site/_posts/2020-10-19-Learning Explanations That Are Hard To Vary.md new file mode 100755 index 00000000..eb4787e8 --- /dev/null +++ b/site/_posts/2020-10-19-Learning Explanations That Are Hard To Vary.md @@ -0,0 +1,72 @@ +--- +layout: post +title: Learning Explanations That Are Hard To Vary +comments: True +excerpt: +tags: ['2020', AI, Invariance] + +--- + +## Introduction + +* The paper builds on the principle "good explanations are hard to vary" to propose that *invariant mechanisms* can be identified by finding explanations (say model parameters) that are hard to vary across examples. +* [Link to the paper](https://arxiv.org/abs/2009.00329) +* [Link to the code](https://github.com/gibipara92/learning-explanations-hard-to-vary) + +## Setup + +* Collection of *d* different datasets (from different environments). Each dataset is a collection of input-target tuples. +* Objective is to learn a function *f* (also called *mechanism*) to map the input to the target (for all the environments). +* The standard approach is to pool the loss for examples corresponding to the different environments and perform gradient updates on this average-pooled loss. +* In this standard gradient-based setup, the model may not learn invariances due to the following reasons: + * Model learned the spurious features first, and now the training loss is too small. + * The pooled loss is generally computed by summing (or averaging) the loss corresponding to individual examples. Thus the gradient for each example is calculated independently. Each sample can be thought of as a dataset of size 1, for which all the features are relevant. + * Gradient descent with averaging (of gradients across the environments) greedily maximizes for the learning speed and not invariance. +* Performing arithmetic mean can be seen as performing an OR operation (i.e., the sum can be high if any one of the constituents is high), whereas performing geometric mean can be seen as performing an AND operation (i.e., the product can be high only if all the constituents are high). + +### Invariant Learning Consistency(ILC) + +* Given an algorithm $$A$$, let $$\theta_{A}^{*}$$ denote the set of convergence points of $$A$$ when trained on all the environments. +* Each convergence point is associated with a consistency score. +* Intuitively, given a convergence point and an environment *e*, find the set of parameters equivalent to the convergence point (in terms of loss) with respect to *e*. Let's call this set as *S*. +* Evaluate the points in this set for all the remaining environments. For the given convergence point, an environment *e'* is consistent with *e* if the maximum difference in the loss for two environments is small, for all points belonging to *S*. +* This idea is used to define the invariant learning consistency score for algorithm $$A$$, which measures the expected consistency of the converged points (on the pooled data) across all the environments. +* The paper shows that the converged points' consistency is linked to the Hessians' geometric mean and that for the convex quadratic case, using the elementwise geometric mean of gradients improves consistency. +* However, there are some practical challenges: + * Geometric mean is defined only when all signs are consistent. This issue can potentially be handled by treating different signs as 0. + * There is very little flexibility in "partial" agreement, and even a single zero gradient component can stop optimization for that component. This can probably be handled by not masking if many environments have a gradient for that component. + * Geometric component needs to be computed in the log-domain (for numerical scalability), but that can be computationally more expensive. + * When using adaptive optimizers like Adam, the exact magnitude of geometric mean will be ignored because of rescaling for the local curvature adaptation. +* Some of these challenges can be handled using average gradients when the geometric mean would be 0 and masking out components based on the sign. + +### AND-mask + +* The ideas from the previous section can be used to develop a practical algorithm called AND-mask. +* Zero-out gradients that have inconsistent signs across some threshold number (hyper-parameter) of environments. +* In the presence of purely random gradient patterns, the AND-mask decreases the signals' strength exponentially fast. + +## Experiments + +### Synthetic Memorization Dataset + +* This is a binary classification task with two kind of features: (i) "meaningful" features that are shared across environments but harder for the model to learn and (ii) "shortcut" features that are easy to learn but not shared across environments. +* While the dataset may look simple, it is difficult to find the invariant mechanism because the "shortcut" features allow for a simple, linear decision boundary, with a large margin that is fast to learn, has perfect accuracy, robust to input noise, and no iid generalization gap. +* Baselines: + * MLPs trained with regularizers like dropout, L1, L2, and batch norm. + * Domain Adversarial Neural Networks (DANN) + * Invariant Risk Minimization (IRM) +* In terms of results, AND-mask with L1/L2 regularizers gives the best results. +* Empirically, the paper shows that the signal from the "meaningful" features is present when the gradients are averaged, but their magnitude is much smaller than the signal from the "shortcut" features. + +### Experiments on CIFAR-10 + +* A ResNet model is trained on the CIFAR-10 dataset with random labels, with and without the AND-mask. +* The model with the AND-mask did not memorize the data, whereas the model without the AND-mask did. As sanity, the paper ensured that both the models generalize well when trained with the original labels. +* Note that for this experiment, every example was treated to have come from its own environment. + +### Behavioral Cloning on CoinRun + +* Train an expert policy using PPO for 400M steps on the full distribution of levels. +* Generate a dataset of state-action pairs. Training data consists of 1000 states from each of the 64 levels, while the test data comes from 2000 levels. +* A ResNet18 model is used as an imitation learning policy. +* The exact implementation of the AND-mask is a little more involved, but the key takeaway is that model trained with AND-mask identifies invariant mechanisms across different levels. \ No newline at end of file diff --git a/site/_posts/2020-11-02-One Solution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL.md b/site/_posts/2020-11-02-One Solution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL.md new file mode 100755 index 00000000..9272cf44 --- /dev/null +++ b/site/_posts/2020-11-02-One Solution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL.md @@ -0,0 +1,79 @@ +--- +layout: post +title: One Solution is Not All You Need - Few-Shot Extrapolation via Structured MaxEnt RL +comments: True +excerpt: +tags: ['2020', 'Deep Reinforcement Learning', 'Latent Variable', 'NeurIPS 2020', 'Reinforcement Learning', AI, DRL, Generalization, NeurIPS, RL] + + +--- + +## Introduction + +* Key idea: Practicing and remembering diverse solutions to a task can lead to robustness to that task's variations. + +* The paper proposes a framework to implement this idea - train multiple policies such that they are *collectively* robust to a new distribution over environments while using a single training environment. + +* [Link to the paper](https://arxiv.org/abs/2010.14484) + +## Setup + +* During training, the agent has access to only one MDP. + +* During the evaluation, the agent encounters a new MDP which has the same state and action space but may have a different reward and transition function. + +* The agent is allowed some interactions (say *k*) with the test MDP and is then evaluated on the test MDP. The setup is referred to as *few-shot robustness*. + +## Structured Maximum Entropy Reinforcement Learning (SMERL) + +* Represent a set of policies using a latent variable policy (i.e., a policy conditioned on a latent variable *z*). + +* This has two benefits: (i) Multiple policies can be represented by the same object, and (ii) diverse behaviors can be learned by encouraging the trajectories, corresponding to different *z* to be different, while being able to solve the task. + +* A diversity-inducing objective is used to encourage the agent to learn different trajectories for different *z*. + +* Specifically, the mutual information between *p(Z)* and marginal trajectory distribution for the latent variable policy is maximized, subject to the constraint that each policy achieves close to optimal returns in the train MDP. + +* The mutual information between *p(Z)* and marginal trajectory distribution for the latent variable policy is lower bounded by the sum of mutual information terms over individual states (appearing in the trajectory). + +* An unsupervised reward function is defined using the mutual information between states and latent variables. + +* $$r(s, a) = log(q_{\phi})(z\|s) - log(p(z))$$ where $$q_{\phi}$$ is a learned discriminator. + +* This unsupervised reward is optimized for only when the policy achieves close to an optimal return, i.e., the environment return is close to the optimal return. Otherwise, the agent optimizes only for the environment return. + +### Implementation + +* SMERL is implemented using SAC with a latent variable maximum entropy policy. + +* The set of latent variables is a fixed discrete set $$Z$$ and $$p(z)$$ is set to be a uniform distribution over this set. + +* At the start of an episode, a $$z$$ is sampled and used throughout the episode. + +* Discriminator $$q_{\phi}(z\|s)$$ is trained to infer $$z$$ from the visited states. + +* A baseline SAC agent is trained beforehand to evaluate if the current training policy achieves close to optimal environment return. + +* During the evaluation, the policy corresponding to each latent variable is executed in the test MDP, and the policy with the maximum return is returned. + +## Theoretical Analysis + +* Given an MDP $$M$$ and $$\epsilon>0$$, the MDP robustness set is defined as the set of all MDPs $$M'$$ where the optimal policy of $$M'$$ produces the same trajectory distribution in $$M'$$ as $$M$$. Moreover, on the training MDP $$M$$, the optimal policies (corresponding to $$M$$ and $$M'$$) obtain similar returns. + +* The paper shows that SMERL generalizes to MDPs belong to the robustness set. + +* It also provides a simplified view of the optimization objective and shows how it naturally leads to a trajectory-centric mutual information objective. + +## Experiments + +* Environments + + * 2D navigation environments with point mass. + + * Mujoco Environments: HalfCheetah-Goal, Walker2d-Velocity, Hopper-Velocity. + +* On the 2D navigation environment, the paper shows that SMERL learns to use different trajectories to reach the goal. + +* On the Mujoco setup, the evaluation shows that SMERL generally outperforms the best-performing baseline or is close to the best-performing baseline on different tasks. + +* Generally, higher train performance does not correlate with higher test performance, and there is no single policy that performs the best across all the tasks. Thus, it should be beneficial to learn multiple diverse policies that can be selected from during testing. \ No newline at end of file diff --git a/site/_posts/2020-11-09-Searching for Build Debt - Experiences Managing Technical Debt at Google.md b/site/_posts/2020-11-09-Searching for Build Debt - Experiences Managing Technical Debt at Google.md new file mode 100755 index 00000000..8722b067 --- /dev/null +++ b/site/_posts/2020-11-09-Searching for Build Debt - Experiences Managing Technical Debt at Google.md @@ -0,0 +1,89 @@ +--- +layout: post +title: Searching for Build Debt - Experiences Managing Technical Debt at Google +comments: True +excerpt: +tags: ['2012', 'Build System', 'Software Engineering', 'Technical Debt', Engineering, IEEE, Software, Systems] + + +--- + +## Introduction + +* The paper describes the efforts to control and repay the technical debt in the build system at Google (called the Build Debt). + +* Guiding Principles: + + * Automate techniques to analyze and fix issues that contribute to technical debt. + + * Make it easier to do the right thing as developers can incur technical debt unknowingly. + + * Make it hard to do the wrong thing, e.g., by building stricter checks into the build process. + +* Note that some of the metrics and design decisions may be outdated now (the paper was written in 2012). However, the core message is still relevant. + +* [Link to the paper](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37755.pdf) + +## Google's Build System Debt + +* BUILD files encapsulate the specifications for building software. + +* Generally, these files are maintained manually, and the dependencies may not be up-to-date over time. + +* In extreme cases, some of the build targets are not built for months. Such targets are called zombie targets. + +* Originally, any project could depend on any other project's internal details, thus creating (sometimes unwanted) couplings. + +* If the lower-level project did not intend to expose some internal details, the unwanted couplings introduce technical debt and make it harder to modify the lower-level project. + +* One form of technical debt is the visibility debt or the cost of back-fitting visibility rules onto the existing build specifications to re-establish the appropriate encapsulations. + +* Another example of technical debt is dead code that can confuse the developers looking for useful APIs. + +## Dependency Debt + +* *Over-declared* or *underutilized* dependencies can slow the build and testing of systems. + +* *Under-declared* dependencies can make the build process brittle and make it difficult to remove *over-declared* dependencies. + +* Potential solutions for *over-declared* dependencies include: + + * Setting aside some dedicated time for fixing build rules. But this approach is not automated, and potential breakages make it harder for developers to do the right thing. + + * Automatically add all the *under-declared* dependencies to the BUILD files. The system can raise an error if a direct dependency is missing, making it harder to do the wrong thing. + + * Automation can be applied for finding/reporting the over-declared dependencies as well. + +* Potential solutions for *underutilized* dependencies include: + + * While it is challenging to automate fixing *underutilized* dependencies, automating the discovery of such dependencies is still useful. + + * Highlighting dependencies with high cost and low removal effort could incentivize developers to clean up their projects. + +## Zombie Targets + +* Zombie targets can be identified by query the results of build and test runs. + +* A target is marked as "dead" if the attempts to build it have failed for at least 90 days. Until then, build errors are considered to be transient. + +* A zombie target can be eliminated by deleting its definition from the BUILD and deleting the source files, which are reachable only via the zombie target. + +## Visibility Debt + +* Originally, the default visibility of all the targets was public, leading to unintended dependencies. + +* The visibility of all the existing builds was set to *legacy_public*, and the default visibility was changed to private. + +* This encouraged developers to explicitly consider if they wanted other projects to depend on their project. + +## Dead Flags + +* Google developed its command-line parsing utilities and defined a set of recognized command-line flags for libraries and binaries. + +* Overtime, the number of flags grew to half a million, and many of these flags are not useful anymore (i.e., dead). + +* These dead flags can it hard to understand and refactor code. + +* Existing flags are analyzed to check which ones have always been set to the same value and replaced by those contents, clearing about 150 thousand flags. + +* Removing dead flags also helps to clean up dead/unreachable code. \ No newline at end of file diff --git a/site/_posts/2020-11-16-Data Management for Internet-Scale Single-Sign-On.md b/site/_posts/2020-11-16-Data Management for Internet-Scale Single-Sign-On.md new file mode 100755 index 00000000..ac734028 --- /dev/null +++ b/site/_posts/2020-11-16-Data Management for Internet-Scale Single-Sign-On.md @@ -0,0 +1,88 @@ +--- +layout: post +title: Data Management for Internet-Scale Single-Sign-On +comments: True +excerpt: +tags: ['2006', 'Big Data', 'Distributed Systems', 'Key Value', 'Software Engineering', Data, Database, DBMS, Engineering, Scale, Software, SSO, Systems, USENIX] + + +--- + + + +## Introduction + +* The paper describes the architecture of an erstwhile single-sign-on (SSO) service used by Google, called Google Accounts (2006). + +* Note that some of the metrics and design decisions may be outdated now (the paper was written in 2006). However, the core message is still relevant. + +* [Link to the paper](https://www.usenix.org/legacy/event/worlds06/tech/prelim_papers/perl/perl.pdf) + +## Operational Constraints + +* SSO's availability affects the availability of all applications that require user sign-in. + +* Generally, systems can achieve high availability by sacrificing consistency, but given the nature of SSO (matching username/passwords), providing an inconsistent view is not a good option, and single-copy consistency is a usability requirement. + +## Berkeley DB + +* Berkeley DB is an embedded, high-performance, scalable, transactional storage system for key-value data and provides both keyed and sequential lookup. + +* It provides a primary copy replication model with a single writer (called master) and multiple read-only replicas. + +* All writes are sent to the master, which first applies the changes and then propagates them to the replicas. + +* The master and the replicas have identical logs, and in case of master failure, a new master is elected from the replicas. + +* Some synchronization may be needed between the replicas in case, e.g., the master dies in between a transaction. + +## SSO Architecture + +* SSO service maps usernames to user account data and services to service-specific data. + +* The SSO database is partitioned into shards, where each shard is a replicated Berkeley DB (having 5 to 15 replicas). + +* Each replica stores the data in a B+-link tree data structure. + +* Consistent reads must go to the master, while non-master replicas can serve " stale" reads. + +* In the case of larger replication groups (say 15 replicas), only a subset of replicas can become master ("electable replicas"). + +* In general, replicas are spread geographically to handle machine-failure, network-failure, and data center-failure. + +* Replicas in a share are kept close to reduce the communication latency, which affects the time to commit a write operation or electing a new master. + +* Some of the shards implement ID-map, i.e., map of username to userid and userid to shards. + +## Database Integration + +* Berkeley DB leaves decisions regarding quorums, leases, etc., up to the application. + +### Quorums + +* SSO chooses a quorum protocol that guarantees that updates are never lost. + +* For the write queries, the master waits for a positive acknowledgment from a majority of the replicas, including itself, before marking the query as completed. + +* When selecting a new leader, SSO requires a majority of replicas to agree. Moreover, Berkeley DB elections always choose a replica with the latest log entry during an election, thus guaranteeing that the new master's log will include all the previous master's updates. + +### Leases + +* The master holds a *master lease* when responding to read queries and refreshes this lease periodically by communicating with a majority of replicas. + +* The lease guarantees that the master is not returning stale data if a partition or failure causes the master to lose its mastership, i.e., holding the lease guarantees that the master is still the master. + +* Moreover, elections can not be completed within the lease timeout interval. + +### Replica Group Membership + + +* SSO maintains a replica configuration containing the logical (DNS) name and IP address of each replica. + +* In case of any changes to the configuration, the changes are specified in a file that the master reads periodically. + +* If the configuration changes, the master initiates a configuration change and update the database. + +* Non-master replicas can get the new configuration from the database. + +* A new replica or a replica that lost state (say due to a failure) starts as a non-voting replica and can not participate in an election till it has caught up with the master as of the time the replica joined (again). diff --git a/site/_posts/2020-11-23-Exploring Simple Siamese Representation Learning.md b/site/_posts/2020-11-23-Exploring Simple Siamese Representation Learning.md new file mode 100755 index 00000000..cbcff11d --- /dev/null +++ b/site/_posts/2020-11-23-Exploring Simple Siamese Representation Learning.md @@ -0,0 +1,75 @@ +--- +layout: post +title: Exploring Simple Siamese Representation Learning +comments: True +excerpt: +tags: ['2020', 'Self Supervised', AI, CV, ImageNet, Siamese, SSL, Unsupervised] + + +--- + +## Introduction + +* The paper shows that Siamese networks can be used for unsupervised learning with images without needing techniques like negative sample pairs, large batch training, or momentum encoders. The training mechanism is referred to as the SimSiam method. + +* [Link to the paper](https://arxiv.org/abs/2011.10566) + + +## Method + +* Given an input image *x*, create two augmented views *x1* and *x2*. + +* These views are processed by an encoder network *f*. + +* One of the views (say *x1*) is processed by the encoder *f* as well as a predictor MLP *h* to obtain a projection *p1* ie *p1 = h(f(x1))*. + +* The second view (*x2*) is processed only by the encoder *f* to obtain an encoding *z2* i.e., *z2 = f(x2)*. + +* Negative cosine similarity is minimized between *p1* and *z2* with the catch that the resulting gradients are not used to update the encoder via *z2*. I.e., Loss = *D(p1, stopgrad(z2))* where *D* is the negative cosine similarity and *stopgrad* is an operation that stops the flow of gradients. + +* In practice, both *p1, z2* and *p2, z1* pairs are used for computing the loss. ie Loss = *0.5 \* (D(p1, stopgrad(z2)) + D(p2, stopgrad(z1)))*. + +## Implementation Details + +* Encoder uses batch norm in all the layers (including output) while projection MLP uses batch norm only in the hidden layers. + +* SGD optimizer with learning rate as *0.05 \* batchsize / 256*, cosine learning rate decay schedule and SGD momentum = 0.9. + +* Unsupervised pretraining on the ImageNet dataset followed by training a supervised linear classifier on the frozen representations. + +## Results + +* Stop-gradient operation is necessary to avoid a degenerate solution. Without stop-gradient, the model maps all inputs to a constant *z*. + +* If the projection layer is removed, the method does not work (because of the loss's symmetric nature). If the loss is also made asymmetric, the method still does not work without the projection layer. However, asymmetric loss + projection layer works. + +* Keeping the projection layer fixed (i.e., not updating during training) avoids collapse but leads to poor validation performance. + +* Training the projection layer with a constant learning rate works better in practice, likely because the projection layer needs to keep adapting before the encoder layer is sufficiently trained. + +* The method works well across different batch sizes. + +* Removing batch norm layers from all the layers in all the networks does not lead to collapse, though the model's performance degrades on the validation dataset. Adding batch norm to the hidden layers alone is sufficient. + +* Adding batch norm to the encoder's output further improves the performance but adding batch norm to all the layers of all the networks makes the training unstable, with the loss oscillating. + +* Overall, while batch norm helps to improve performance, it is not sufficient to avoid collapse. + +* The setup does not collapse when the cross-entropy loss replaces the cosine loss. + + +## What is SimSiam solving? + +* Given that the stop-gradient operation seems to be the critical ingredient for avoiding collapse, the paper hypothesizes that SimSiam is solving a different optimization problem. + +* The hypothesis is that SimSiam is implementing an Expectation-Maximisation (EM) algorithm with two sets of variables and two underlying sub-problems. + +* The paper performs several experiments to test this hypothesis. For example, they consider *k* SGD steps for the first problem before performing an update for the second problem, showing that the alternating optimization is a valid formulation, of which SimSiam is a particular case. + +## Comparison to other methods + +* SimSiam achieves the highest accuracy among SimCLR, MoCo, BYOL, and SwAV for training under 100 epochs. However, it lags behind other methods when trained longer. + +* SimSiam's representations are transferable beyond the ImageNet tasks. + +* Adding projection layer and stop-gradient operator to SimCLR does not improve its performance. \ No newline at end of file diff --git a/site/_posts/2020-11-30-Consistency Tradeoffs in Modern Distributed Database System Design.md b/site/_posts/2020-11-30-Consistency Tradeoffs in Modern Distributed Database System Design.md new file mode 100755 index 00000000..b52a820a --- /dev/null +++ b/site/_posts/2020-11-30-Consistency Tradeoffs in Modern Distributed Database System Design.md @@ -0,0 +1,74 @@ +--- +layout: post +title: Consistency Tradeoffs in Modern Distributed Database System Design +comments: True +excerpt: +tags: ['2012', 'Big Data', 'Distributed Systems', 'Software Engineering', CAP, Data, Database, DBMS, Engineering, IEEE, Latency, Scale, Software, Systems] + +--- + +## Introduction + +* CAP theorem has been influential in the design decisions for distributed databases. + +* However, designers incorrectly assume that the CAP theorem "always" imposes restrictions in terms of the tradeoff between availability and consistency. In contrast, the tradeoff is applicable only in the case of partitions. + +* CAP theorem led to the development of highly available systems with reduced consistency models (and reduced ACID guarantees). + +* Another tradeoff - between latency and consistency - has also been influential for database design. + +* The paper unifies CAP and latency-consistency tradeoffs into a single formulation called PACELC. + +* Note that some of the observations, especially ones about the databases, may be outdated now (the paper was written in 2012). However, the core message is still relevant. + +* [Link to the paper](https://www.cs.umd.edu/~abadi/papers/abadi-pacelc.pdf) + +## Latency-Consistency Tradeoff + +* Low latency (or high availability) means that the system must replicate data. + +* In case of an update query, three possibilities arise: + + * The system can choose to send data updates to all the replicas at once. This leads to two possibilities: + + * A replica can receive the update queries in an arbitrary order, thus breaking consistency with other replicas. + + * Alternatively, the replicas could use some protocol to agree on the order of updates. However, this can introduce latency. + + * The update queries can be first sent to a master replica. + + * The master replica can apply the updates and send them to the other replicas using one of the following strategies: + + * Synchronous replication where the master waits for all the updates to be applied to a replica(s). However, this approach introduces latency. + + * Asynchronous replication where the master assumes the update to be complete before it completes. In this case, the latency-consistency tradeoff depends on how read queries are handled: + + * The system can send all read queries to the master. In this case, there are no consistency issues, but additional latency is introduced because all the read queries go to the same replica, thus potentially overloading it. + + * Alternatively, the read query can be served from any replica. While this improves read latency, the results can be inconsistent now. + + * Use a mix of Synchronous and Asynchronous replication - i.e., some of the write queries are Synchronous, and others are Asynchronous. In this case, the latency-consistency tradeoff depends on how read queries are handled: + + * If the read is routed to at least one replica that has been Synchrnously updated, the consistency can be preserved, with additional latency for discovering the updated replica, etc. + + * If the read query can not be routed to an updated replica (maybe because none of the replicas is updated), then either latency suffers or inconsistent read can be performed. + + * The update query is first sent to an arbitrary replica. + + * This is the same as the previous case, with the query going to an arbitrary replica instead of the master replica, and suffers from the same latency issues as the last case. + +* In a nutshell, the tradeoff between latency and consistency is always present, irrespective of network failure. + +* This contrasts with the CAP theorem, which imposes the tradeoff between availability and consistency only in the case of a network partition. + +## PACELC + +* If there is a partition (P), how does the system tradeoff availability (A) and consistency (C); else (E), when the system is running without failures, how does the system tradeoff latency (L) and consistency (C)? + +* The latency-consistency tradeoff (ELC) is relevant only when the data is replicated. + +* Default versions of Dynamo, Cassandra, and Riak were PA/EL systems, i.e., if a partition occurs, availability is prioritized. In the absence of partition, lower latency is prioritized. + +* Fully ACID systems (VoltDB, H-Store, and Megastore) and others like BigTable and HB are PC/EC, i.e., they prioritize consistency and give up availability and latency. + +* MongoDB can be classified as a PA/EC system, while PNUTS is a PC/EL system. \ No newline at end of file diff --git a/site/_posts/2020-12-07-CAP twelve years later - How the rules have changed.md b/site/_posts/2020-12-07-CAP twelve years later - How the rules have changed.md new file mode 100755 index 00000000..fa19860d --- /dev/null +++ b/site/_posts/2020-12-07-CAP twelve years later - How the rules have changed.md @@ -0,0 +1,95 @@ +--- +layout: post +title: CAP twelve years later - How the rules have changed +comments: True +excerpt: +tags: ['2012', 'Big Data', 'Distributed Systems', ACID, BASE, CAP, Database, DBMS, Engineering, IEEE, Latency, Scale, Systems] + +--- + +## Introduction + +* The CAP theorem states that any system sharing data over the network can only have at most two (out of three) desirable properties: + + * consistency (C), i.e., a single, up-to-date copy of the data; + + * high availability (A) of that data (for updates); and + + * tolerance to network partitions (P). + +* This "2 of 3" formulation is misleading as it oversimplifies the interplay between properties. + +* [Link to the paper](https://ieeexplore.ieee.org/abstract/document/6133253) + +## ACID vs. BASE + +* ACID is a design philosophy that focuses on consistency as reflected in the traditional relational databases. + +* The four properties in ACID are: + + * Atomicity (A), i.e., the operations are atomic, and either the entire operation succeeds or none of it succeeds. + + * Consistency (C), i.e., a transaction preserves all the rules. Note that the consistency in CAP is a subset of consistency in ACID. + + * Isolation (I), i.e., transactions occur in isolation and do not affect each other. + + * Durability (D), i.e., the transactions are durable irrespective of system failure. + + +* BASE is an alternate design philosophy that focuses on availability as reflected in the NoSQL databases. + +* The four properties in BASE are: + + * Basic Availability (BA), i.e., the database appears to work most of the time. + + * Soft state (S), i.e., the system's state can change over time as it becomes eventually consistent. + + * Eventual consistency (E), i.e., the system will eventually become consistent over time. + +## CAP confusion + +* Generally, partitionability is seen as a must-have, thus reducing the choice to be between availability and consistency. + +* This view is somewhat misleading because the choice between C, A, and P is not binary but granular. + +* The choice between C and A can occur at various granularity levels, and different components (of a larger system) can prioritize different aspects. + +* Similarly, the CAP theorem generally ignores latency even though it is closely related to partitionability. For example, failing to achieve consistency within a time-bound (i.e., latency) implies a partition. + +* In general, there is no global notion of partition - some subset of nodes may experience a partition, and others may not. + +* Once a partition is detected, the system can then choose between C and A. + +## Managing Partitions + +* Three-step process for managing partitions: + + * Detect the start of a partition. + + * Enter an explicit partition mode that may limit some operations. + + * Possible strategies: + + * Reduce availability by limiting some operations. + + * Record extra information that can be used during partition recovery. + + * The strategy depends on the invariants that the system should maintain. + + * For example, if the invariant is that the keys (in a table) should be unique, the system could allow duplicate keys for some time and perform a de-duplication step during partition recovery. + + * A counterexample is a monetary transaction (e.g., charging a credit card). In such cases, the system could disable the operation and record it for performing later. Sometimes this "unavailability" is not visible to the user. + + * History of operations (over replicas across different partitions) can be tracked using version vectors of the form (node, logical time). The system can easily recreate the order in which they were executed (or mark them as being concurrent). + + * Initiate partition recovery when communication is restored and make the state across the partitions consistent. + + * One common approach is to revert to the state when the partition was detected and apply the operations consistently across all the replicas. + + * This may require some extra effort to merge conflicts. + + * One workaround can be to constrain the use of certain operations so that the system does not encounter merge conflicts during recovery. + + * Sometimes, certain invariants may be violated when the system is in the partition mode and needs to be fixed during recovery. + + * The key takeaway is that when partitions exist, the choice between availability and consistency is not binary, and both can be optimized for. \ No newline at end of file diff --git a/site/_site b/site/_site index c0e9649e..7fe877f6 160000 --- a/site/_site +++ b/site/_site @@ -1 +1 @@ -Subproject commit c0e9649eda6b04c9c4d68f40621d93bac32fa3a0 +Subproject commit 7fe877f6f12aaae7463431cefb63e1492d4cd555