Developing safe and beneficial AI systems requires making them aware and aligned with human preferences. Since humans have significant control over the environment they operate in, we conjecture that RL agents implicitly learn human preferences. Our research aim is to first show that these preferences exist in an agent and then extract these preferences. To start, we tackle this problem in a toy grid-like environment where a reinforcement learning (RL) agent is rewarded for collecting apples. Since it has been shown in previous work (Wichers 2020) that these implicit preferences exist and can be extracted, our first approach involved applying a variety of modern interpretability techniques to the RL agent trained in this environment to find meaningful portions of its network. We are currently pursuing methods to isolate a subnetwork within the trained RL agent which predicts human preferences.
Our project report has been published on LessWrong which includes a detailed overview of our intermediate results as well:
Please refer to the notebooks and the readme in /agent for details on the agents.
Nevan Wichers, Riccardo Volpato, Mislav Jurić and Arun Raja
We would like to thank Paul Christiano, Evan Hubinger, Jacob Hilton and Christos Dimitrakakis for their research advice during AI Safety Camp 2020.
Deep Reinforcement Learning from Human Preferences. Christiano et. al. (2017)
RL Agents Implicitly Learning Human Preferences. Wichers N. (2020)
Understanding RL Vision. Hilton et. al. (2020)
ViZDoom GitHub code repository. Wydmuch et. al. (2018)
What's Hidden in a Randomly Weighted Neural Network?. Ramanujan et. al. (2020)