GitHub

This repository contains code and data associated with Word embeddings quantify 100 years of gender and ethnic stereotypes. PDF available here.

If you use the content in this repository, please cite:

Garg, N., Schiebinger, L., Jurafsky, D. & Zou, J. Word embeddings quantify 100 years of gender and ethnic stereotypes. PNAS 201720347 (2018). doi:10.1073/pnas.1720347115

To re-run all analyses and plots:

download vectors from online sources and normalize by l2 norm (links in paper and below)
set up parameters to run as in run_params.csv
run changes_over_time.py
run create_final_plots_all.py

dataset_utilities/ contains various helper scripts to preprocess files and create word vectors. From a corpus, for example LDC95T21-North-American-News, that contains many text files (each containing an article) from a given year, first run create_yrly_datasets.py to create a single text file per year (with only valid words). Then, run pipeline.py on each of these files to create vectors, potentially combining multiple years into a single training set. normalize_vectors.py contains utilities to standardize the vectors.

We have uploaded the New York Times embeddings generated for this paper. They are available at http://stanford.edu/~nkgarg/NYTembeddings/. The files are in format "vectorsnytXXXX-YYYY.txt", where XXXX is the start year for the training data, and YYYY is the end year, NOT inclusive. (So, vectorsnyt1987-1990.txt has embeddings trained using data from 1987, 1988, and 1989). The text data is from the New York Times Annotated Corpus; if you use these vectors, please cite the original source as well.

We use the following embeddings publicly available online. If you use these embeddings, please cite the associated papers.

Note: the paper mistakenly indicates that the Genre-Balanced American English embeddings contain data from both Google Books and the Corpus of Historical American English (COHA). It contains only data from COHA, though the same website also provides data trained using Google Books.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
dataset_utilities		dataset_utilities
.gitignore		.gitignore
README.md		README.md
changes_over_time.py		changes_over_time.py
create_final_plots_all.py		create_final_plots_all.py
latexify.py		latexify.py
plot_creation.py		plot_creation.py
run_params.csv		run_params.csv
utilities.py		utilities.py
variance_only_over_time.py		variance_only_over_time.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

HuanDeng1990/EmbeddingDynamicStereotypes

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages