Skip to content

Commit

Permalink
added dataset card
Browse files Browse the repository at this point in the history
  • Loading branch information
AndreFCruz committed Nov 28, 2024
1 parent 6010667 commit 3313e1b
Showing 1 changed file with 148 additions and 0 deletions.
148 changes: 148 additions & 0 deletions DATASET_CARD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Dataset Card for _folktexts_ <!-- omit in toc -->

[Folktexts](https://github.com/socialfoundations/folktexts) is a suite of Q&A
datasets with natural outcome uncertainty, aimed at evaluating LLMs' calibration
on unrealizable tasks.

The *folktexts* datasets are derived from US Census data products.
Namely, the datasets made available here are derived from the
[2018 Public Use Microdata Sample](https://www.census.gov/programs-surveys/acs/microdata/documentation/2018.html)
(PUMS). Individual features are mapped to natural text using the respective
[codebook](https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2018.pdf).
Each task relates to predicting different individual
characteristics (e.g., income, employment) from a set of demographic features
(e.g., age, race, education, occupation).

Importantly, every task has natural outcome uncertainty. That is, in general,
the features describing each row do not uniquely determine the task's label.
For calibrated models to perform well on this task, the model must correctly
output nuanced scores between 0 and 1, instead of simply outputting discrete
labels 0 or 1.

Namely, we make available the following tasks in natural language Q&A format:
- `ACSIncome`: Predict whether a working adult earns above $50,000 yearly.
- `ACSEmployment`: Predict whether an adult is an employed civilian.
- `ACSPublicCoverage`: Predict individual public health insurance coverage.
- `ACSMobility`: Predict whether an individual changed address within the last year.
- `ACSTravelTime`: Predict whether an employed adult has a work commute time longer than 20 minutes.


These tasks follow the same naming and feature/target columns as the
[folktables](https://github.com/socialfoundations/folktables)
tabular datasets proposed by
[Ding et al. (2021)](https://proceedings.neurips.cc/paper_files/paper/2021/file/32e54441e6382a7fbacbbbaf3c450059-Paper.pdf).
The folktables tabular datasets have seen prevalent use in the algorithmic
fairness and distribution shift communities. We make available natural language
Q&A versions of these tasks.

The datasets are made available in standard multiple-choice Q&A format (columns
`question`, `choices`, `answer`, `answer_key`, and `choice_question_prompt`), as
well as in numeric Q&A format (columns `numeric_question`,
`numeric_question_prompt`, and `label`).
The numeric prompting (also known as *verbalized prompting*) is known to improve
calibration of zero-shot LLM risk scores
[[Tian et al., EMNLP 2023](https://openreview.net/forum?id=g3faCfrwm7);
[Cruz et al., NeurIPS 2024](https://arxiv.org/pdf/2407.14614)].

**The accompanying [`folktexts` python package](https://github.com/socialfoundations/folktexts)
eases customization, evaluation, and benchmarking with these datasets.**

Table of contents:
- [Dataset Details](#dataset-details)
- [Uses](#uses)
- [Dataset Structure](#dataset-structure)
- [Dataset Creation](#dataset-creation)
- [Citation](#citation)
- [More Information](#more-information)


## Dataset Details

### Dataset Description <!-- omit in toc -->

- **Language(s) (NLP):** English
- **License:** Code is licensed under the [MIT license](https://github.com/socialfoundations/folktexts/blob/main/LICENSE); Data license is governed by the U.S. Census Bureau [terms of service](https://www.census.gov/data/developers/about/terms-of-service.html).

### Dataset Sources <!-- omit in toc -->

- **Repository:** https://github.com/socialfoundations/folktexts
- **Paper:** https://arxiv.org/pdf/2407.14614
- **Data source:** [2018 American Community Survey Public Use Microdata Sample](https://www.census.gov/programs-surveys/acs/microdata/documentation/2018.html)

## Uses

The datasets were originally used to evaluate LLMs' ability to produce
calibrated and accurate risk scores in the [Cruz et al. (2024)](https://arxiv.org/pdf/2407.14614) paper.

Other potential uses include evaluating the fairness of LLMs' decisions,
as individual rows feature protected demographic attributes such as `sex` and
`race`.


## Dataset Structure

**Description of dataset columns:**
- `id`: A unique row identifier.
- `description`: A textual description of an individual's features, following a bulleted-list format.
- `instruction`: The instruction used for zero-shot LLM prompting (should be pre-appended to the row description).
- `question`: A question relating to the task's target column.
- `choices`: A list of two answer options relating to the above question.
- `answer`: The correct answer from the above list of answer options.
- `answer_key`: The correct answer key; i.e., `A` for the first choice, or `B` for the second choice.
- `choice_question_prompt`: The full multiple-choice Q&A text string used for LLM prompting.
- `numeric_question`: A version of the question that prompts for a *numeric output* instead of a *discrete choice output*.
- `label`: The task's label. This is the correct output to the above numeric question.
- `numeric_question_prompt`: The full numeric Q&A text string used for LLM prompting.
- `<tabular-columns>`: All other columns correspond to the tabular features in this task. Each of these features will also appear in text form on the above description column.

The dataset was randomly split in `training`, `test`, and `validation` data,
following an 80%/10%/10% split.
Only the `test` split should be used to evaluate zero-shot LLM performance.
The `training` split can be used for fine-tuning, or for fitting traditional
supervised ML models on the tabular columns for metric baselines.
The `validation` split should be used for hyperparameter tuning, feature
engineering or any other model improvement loop.

## Dataset Creation

### Source Data <!-- omit in toc -->

The datasets are based on publicly available data from the American Community
Survey (ACS) Public Use Microdata Sample (PUMS), namely the
[2018 ACS 1-year PUMS files](https://www.census.gov/programs-surveys/acs/microdata/documentation.2018.html#list-tab-1370939201).

#### Data Collection and Processing <!-- omit in toc -->

The categorical values were mapped to meaningful natural language
representations using the `folktexts` package, which in turn uses the official
[ACS PUMS codebook](https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2018.pdf).
The data download and processing was aided by the `folktables` python package,
which in turn uses the official US Census web API.

#### Who are the source data producers? <!-- omit in toc -->

U.S. Census Bureau.

## Citation

If you find this useful in your research, please consider citing the following paper:

```bib
@inproceedings{
cruz2024evaluating,
title={Evaluating language models as risk scores},
author={Andr{\'e} F Cruz and Moritz Hardt and Celestine Mendler-D{\"u}nner},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=qrZxL3Bto9}
}
```

## More Information

More information is available in the [`folktexts`](https://github.com/socialfoundations/folktexts) package repository
and the [accompanying paper](https://arxiv.org/pdf/2407.14614).

### Dataset Card Authors <!-- omit in toc -->

[André F. Cruz](https://github.com/andrefcruz)

0 comments on commit 3313e1b

Please sign in to comment.