Extraction of human preferences 👨→🤖

Developing safe and beneficial AI systems requires making them aware and aligned with human preferences. Since humans have significant control over the environment they operate in, we conjecture that RL agents implicitly learn human preferences. Our research aim is to first show that these preferences exist in an agent and then extract these preferences. To start, we tackle this problem in a toy grid-like environment where a reinforcement learning (RL) agent is rewarded for collecting apples. Since it has been shown in previous work (Wichers 2020) that these implicit preferences exist and can be extracted, our first approach involved applying a variety of modern interpretability techniques to the RL agent trained in this environment to find meaningful portions of its network. We are currently pursuing methods to isolate a subnetwork within the trained RL agent which predicts human preferences.

Report and results 📝 📉

Our project report has been published on LessWrong which includes a detailed overview of our intermediate results as well:

Running experiments 🧪

Please refer to the notebooks and the readme in /agent for details on the agents.

Team 🧑‍🤝‍🧑

Nevan Wichers, Riccardo Volpato, Mislav Jurić and Arun Raja

Acknowledgements 🙏

We would like to thank Paul Christiano, Evan Hubinger, Jacob Hilton and Christos Dimitrakakis for their research advice during AI Safety Camp 2020.

References 📚

Deep Reinforcement Learning from Human Preferences. Christiano et. al. (2017)

RL Agents Implicitly Learning Human Preferences. Wichers N. (2020)

Understanding RL Vision. Hilton et. al. (2020)

ViZDoom GitHub code repository. Wydmuch et. al. (2018)

What's Hidden in a Randomly Weighted Neural Network?. Ramanujan et. al. (2020)

Name		Name	Last commit message	Last commit date
Latest commit History 370 Commits
agent		agent
data/simple_env_1		data/simple_env_1
extractors		extractors
images		images
model_ckpt		model_ckpt
saved_model2		saved_model2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
concatenation_experiments_and_error_analysis.ipynb		concatenation_experiments_and_error_analysis.ipynb
exploratory_data_analysis.ipynb		exploratory_data_analysis.ipynb
export_activations_reductions.ipynb		export_activations_reductions.ipynb
export_lucid.ipynb		export_lucid.ipynb
export_tf2.ipynb		export_tf2.ipynb
extract_preferences.ipynb		extract_preferences.ipynb
find_subnets_torch.ipynb		find_subnets_torch.ipynb
lucid_save_model.pb.		lucid_save_model.pb.
pref_extract_more_inputs.ipynb		pref_extract_more_inputs.ipynb
requirements.txt		requirements.txt
sample_size_analysis.ipynb		sample_size_analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extraction of human preferences 👨→🤖

Report and results 📝 📉

Running experiments 🧪

Team 🧑‍🤝‍🧑

Acknowledgements 🙏

References 📚

About

Releases

Packages

Contributors 5

Languages

License

arunraja-hub/Preference_Extraction

Folders and files

Latest commit

History

Repository files navigation

Extraction of human preferences 👨→🤖

Report and results 📝 📉

Running experiments 🧪

Team 🧑‍🤝‍🧑

Acknowledgements 🙏

References 📚

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages