diff --git a/site/_posts/2021-03-29-Synthesized Policies for Transfer and Adaptation across Tasks and Environments.md b/site/_posts/2021-03-29-Synthesized Policies for Transfer and Adaptation across Tasks and Environments.md
new file mode 100755
index 00000000..29e2c89c
--- /dev/null
+++ b/site/_posts/2021-03-29-Synthesized Policies for Transfer and Adaptation across Tasks and Environments.md	
@@ -0,0 +1,113 @@
+---
+layout: post
+title: Synthesized Policies for Transfer and Adaptation across Tasks and Environments
+comments: True
+excerpt: 
+tags: ['2018', 'Transfer Learning', 'Inverse Reinforcement Learning', 'Reinforcement Learning', 'NeurIPS 2018', AI, Compositionality, Generalizatio, IRL, NeurIPS, RL]
+---
+
+## Introduction
+
+* The paper studies transfer learning in RL, focusing on simultaneous transfer across both tasks and environments.
+
+* The key idea is to learn task and environment embeddings and compose them using a meta-rule, and the proposed approach is called SYNPO (Synthesized Policies).
+
+* [Link to the paper](https://arxiv.org/abs/1904.03276)
+
+## Setup
+
+* Three settings considered:
+
+    * *S1*: Transfer to a new (environment, task) pair when the agent has been trained on the environment and the task before (but not simultaneously). 
+    
+    * *S2*: Transfer to a new (environment, task) pair where either the environment or the task is not seen previously. 
+
+    * *S3*: Transfer to a new (environment, task) pair where neither the environment nor the task is seen previously. 
+
+* In the second and third settings, the agent is allowed to collect some data in the new environment or task.
+
+* The (environment, task) combinations that the agent has seen during training are referred to as *seen* combinations, while the remaining combinations are referred to as the *unseen* combinations.
+
+* The key idea is to:
+    
+    * learn embeddings of environments and tasks
+    
+    * use these embeddings to compose a policy (parameterized as the linear combination of the policy basis).
+
+* A disentanglement objective is used to decouple the task and environment embedding. 
+
+### Policy Composition
+
+* Given an (environment, task) pair $z = (\epsilon, \tau)$, the policy is given as $\pi_z(a\|s) \propto exp(\psi_s^TU(e_{\epsilon}, e_{\tau})\phi_{a} + b_{\pi} )) $.
+
+* Here $b_{\pi}$ is a scalar bias, $\psi_{s}$ and $\phi_{a}$ are state and action representations, $U$ is parameterized as the linear comination of $K$ basis matrices $\Theta_k$
+
+* $U(e_{\epsilon}, e_{\tau}) = \sum_{k=1}^{K}\alpha_k(e_{\epsilon}, e_{\tau})\Theta_k$.
+
+* The basis matrices (denoted by $\Theta_k$) are shared across tasks while the coefficients ($\alpha_k$) are specific to the (environment, task) pair.
+
+* During training, the agent also predicts rewards using the same set of basis but different coefficients.
+
+### Disentangling environment and task embeddings
+
+* Given an (environment, task) pair, the agent is trained to decode the environment (and task) given the agent's trajectory. 
+
+* The sequence of state-action pairs (in the trajectory) is mapped to a sequence of state-action representations, given by $\psi_s^T\Theta_k\phi_{a}$
+
+## Experiment Setup
+
+* The agent is trained (and evaluated) on imitation learning (mostly) and reinforcement learning setup.
+
+### Environments
+
+* GRIDWORLD
+
+    * Twenty $16 \times 16$ gird-aligned mazes that are similar in appearance but differ in topology.
+
+    * The task is to collect colored blocks in a given order. In each task, the starting position of the agent and the position of the blocks is randomized.
+    
+    * Each environment has 20 tasks, leading to a total of 400 (environment, task) combinations.
+
+* [THOR](https://arxiv.org/abs/1712.05474)
+
+    * This is a 3D simulator where the agent is placed in indoor photo-realistic scenes.
+
+    * The task is the search for objects and perform actions like "put cabbage on the fridge."
+
+    * The setup uses 19 scenes (environments), with each environment comprising of 21 tasks.
+
+### Baselines
+
+* MLPs that concatenate state, environment embeddings, and task embedding.
+
+* [Successor feature model](https://arxiv.org/abs/1606.05312)
+
+* [Module Network](https://arxiv.org/abs/1609.07088)
+
+* Multi-task Learning where the distinction between the environments is ignored.
+
+## Results
+
+* GRIDWORLD
+
+    * In the first setting (*S1*)
+        
+        * SYNPO outperforms all the baselines.
+
+        * As the agent is trained on more (environment, task) combinations, its performance on the unseen combinations improves. This trend saturates when the *seem/total* ratio reaches about 0.4 (i.e., training on 40% of all the combinations).
+
+        * Task disentanglement is more important than environment disentanglement.
+
+    * In the second and third setting (*S2* and *S3*)
+
+        * The agent uses one demonstration from each test pair to finetune the embeddings.
+
+        * *S2* is an easier setting than *S3*.
+
+        * Transfer learning across tasks is easier than transfer learning across environments.
+
+*  THOR
+
+    * SYNPO outperforms all the baselines on both seen and unseen combinations.
+
+  
\ No newline at end of file
diff --git a/site/_site b/site/_site
index aaef8796..6c9d21df 160000
--- a/site/_site
+++ b/site/_site
@@ -1 +1 @@
-Subproject commit aaef8796ec89128f003b90832df9940746f2000d
+Subproject commit 6c9d21df445759dc4ee55d2170f9a88dc8e9e331