Skip to content

AI Safety Camp (Toronto 2020) Research Project: Extracting implicitly learned human preferences from RL agents

License

Notifications You must be signed in to change notification settings

arunraja-hub/Preference_Extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extraction of human preferences 👨→🤖

Developing safe and beneficial AI systems requires making them aware and aligned with human preferences. Since humans have significant control over the environment they operate in, we conjecture that RL agents implicitly learn human preferences. Our research aim is to first show that these preferences exist in an agent and then extract these preferences. To start, we tackle this problem in a toy grid-like environment where a reinforcement learning (RL) agent is rewarded for collecting apples. Since it has been shown in previous work (Wichers 2020) that these implicit preferences exist and can be extracted, our first approach involved applying a variety of modern interpretability techniques to the RL agent trained in this environment to find meaningful portions of its network. We are currently pursuing methods to isolate a subnetwork within the trained RL agent which predicts human preferences.

Report and results 📝 📉

Our project report has been published on LessWrong which includes a detailed overview of our intermediate results as well:

Running experiments 🧪

Please refer to the notebooks and the readme in /agent for details on the agents.

Team 🧑‍🤝‍🧑

Nevan Wichers, Riccardo Volpato, Mislav Jurić and Arun Raja

Acknowledgements 🙏

We would like to thank Paul Christiano, Evan Hubinger, Jacob Hilton and Christos Dimitrakakis for their research advice during AI Safety Camp 2020.

References 📚

Deep Reinforcement Learning from Human Preferences. Christiano et. al. (2017)

RL Agents Implicitly Learning Human Preferences. Wichers N. (2020)

Understanding RL Vision. Hilton et. al. (2020)

ViZDoom GitHub code repository. Wydmuch et. al. (2018)

What's Hidden in a Randomly Weighted Neural Network?. Ramanujan et. al. (2020)

About

AI Safety Camp (Toronto 2020) Research Project: Extracting implicitly learned human preferences from RL agents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published