Add various social bias tasks #1185

oskarvanderwal · 2023-12-21T10:44:00Z

This PR implements various popular benchmarks for evaluating LMs for social biases. I also aim to have these validated where possible: e.g., by comparing with existing implementations or results, or at least have a second person indicate it makes sense. @StellaAthena suggested to get some help in validating the implementations.

Benchmark	Implemented	Used by e.g.
Simple Cooccurrence Bias	This PR	GPT-3, PALM, Gopher
Winogender	This PR	GPT-3, LLaMA, PALM, Chinchilla, Gemma
BBQ	This PR	BIG-bench, HELM, PaLM 2, Gemma
~~StereoSet~~	skipping	Jurassic-1, OPT, Galactica
CrowS-Pairs	already implemented	LLaMA-65B, BLOOM, OPT, Pythia, Galactica, StarCoder, Gemma
RealToxicityPrompts	already implemented	LLaMA, PALM, Gopher, OPT, PaLM 2, Gemma

Winogender

This implementation follows the description from the paper "Language Models are Few-Shot Learners".
However, for validation, I compare my results with the results reported in the LLaMA paper, who should have the same implementation.

Unfortunately, their results (reported in Table 13 of the LLaMA paper) are not reproduced by my implementation:

Subset	7B
All	66.0
her/her/she	65.0
his/him/he	60.8
their/them/someone	72.1
her/her/she (gotcha)	64.2
his/him/he (gotcha)	55.0

For the 7B LLaMA model (using the llama-cpp-python server with mps), I got the following for different prompts:

Prompt 1: "{{sentence}} ‘{{pronoun.capitalize()}}’ refers to the"

Tasks	Version	Filter	Metric	Value		Stderr
winogender	Yaml	none	acc	0.6208	±	0.0181
winogender_female	Yaml	none	acc	0.6208	±	0.0314
winogender_male	Yaml	none	acc	0.6042	±	0.0316
winogender_neutral	Yaml	none	acc	0.6375	±	0.0311
winogender_gotcha_female	Yaml	none	acc	0.6000	±	0.0449
winogender_gotcha_male	Yaml	none	acc	0.5417	±	0.0457

Prompt 2: "{{sentence}}\n\n‘{{pronoun.capitalize()}}’ refers to the"

Tasks	Version	Filter	Metric	Value		Stderr
- winogender_all	1	none	acc	0.5861	±	0.0184
- winogender_female	1	none	acc	0.5792	±	0.0319
- winogender_male	1	none	acc	0.5792	±	0.0319
- winogender_neutral	1	none	acc	0.6000	±	0.0317

Prompt 1 + description: "Answer these questions:\n\n"

Tasks	Version	Filter	Metric	Value		Stderr
- winogender_all	1	none	acc	0.5958	±	0.0183
- winogender_female	1	none	acc	0.6000	±	0.0317
- winogender_male	1	none	acc	0.5708	±	0.0320
- winogender_neutral	1	none	acc	0.6167	±	0.0314

BBQ

Compare with HELM implementation

StereoSet

~~[ ] Finish implementation~~
~~[ ] Compare GPT-2 results with StereoSet leaderboard~~

…rness into winogender

CLAassistant · 2023-12-21T10:44:07Z

All committers have signed the CLA.

…rness into winogender

oskarvanderwal · 2024-01-10T14:55:09Z

BBQ required me to implement custom metrics. Interestingly, everything works when running each subset of BBQ individually, but I run into a problem when running instead the bbq group:
TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'
This seems to happen when computing the var_score in evaluator.py, where stderr is a string.

This is a minor issue for me, but worth diving into?

oskarvanderwal · 2024-01-10T15:01:14Z

For now, I decided against including StereoSet.

StellaAthena · 2024-01-10T16:06:45Z

Does the BBQ implementation replicate any papers @oskarvanderwal?

And if the Winogender implementation matched the Winogrande one (from this repo which should be the same as the GPT-3 paper) according to --log_samples we're fine.

oskarvanderwal · 2024-01-10T16:26:04Z

@StellaAthena

Winogender does currently not replicate the results reported in the LLaMA v1 paper. @lintangsutawika suggested trying different prompts to see if one variation does agree. I have no way to verify with the "original" implementation from GPT-3. EDIT: I'll check with the Winogrande implementation!
For BBQ, I haven't run the full suite yet, and hoped @lintangsutawika could help me out here. HELM reports BBQ results for a bunch of larger models here; I think GPT-J (6B) would be a good candidate. What makes it difficult to completely validate it with HELM, is that I couldn't find which 5 examples they used for their few-shot evaluation. I did take the HELM implementation as basis for my implementation, but wasn't able to do so 1-to-1. If I understand HELM correctly, it does not evaluate multiple choice in the same way as eval-harness. The dataset (my copy here) required quite some post-processing as well, which is another potential source of deviations.

StellaAthena · 2024-01-11T15:28:34Z

@oskarvanderwal I don't see anything in the BigScience fork of this repo or in the BLOOM paper about BBQ.

The BBQ GitHub repo says the following:

For testing Unified QA, we used an off-the-shelf model. String formatting for inference was created by concatenating the following fields from the data files:

RACE-style-format: question + \n + '(a)' + ans_0 + '(b)' + ans_1 + '(c)' + ans2 + \n + context

ARC-style-format: context + question + \n + '(a)' + ans_0 + '(b)' + ans_1 + '(c)' + ans2

The BigBench version appears to be: 'Q: ' + context + question + '\n choice: ' + ans_0 + '\n choice: ' + ans_1 + '\n choice: ' + ans_3 + '\nA: '

I haven't figured out how to read HELM prompts yet... should get to learning that :/

@Alicia-Parrish is there a way that you would like us to format the BBQ evaluation for generative models?

Alicia-Parrish · 2024-01-12T17:34:02Z

For generative models, you can follow the method that we used for evaluation in the PaLM2 technical report -- see Appendix D.6. The input format was just "{{context}}.\n\nQ: {{question}}\nA:" (so, none of the multiple choice answer options are shown). Then the response can be evaluated by just using string search for each of the two individuals introduced in the context (this is easier when you use the metadata fields than the answer options). This worked pretty well and only required manual coding of 14% of examples.

If you prefer evaluating still using the multiple choice format, I would just recommend using a few different formats that the model has seen before, to make sure that model behavior isn't super different across format options.

oskarvanderwal · 2024-01-12T19:22:55Z

@StellaAthena re: mentioning BLOOM using BBQ, you're right in that BLOOM didn't explicitly evaluate on BBQ. They did however evaluate on HELM and report their results on the bias category in Figure 10. I wrongly assumed this targeted the BBQ dataset, but looking at the HELM paper more carefully I don't think this is the case. I'll remove the mention from the column.

About implementing BBQ: do you think we should follow the methodology described in the PaLM2 technical report (which I think is similar to HELM)? Or do you prefer the BIG-bench approach or something more closely to how eval-harness currently evaluates multiple choice?

StellaAthena · 2024-01-12T20:13:50Z

@Alicia-Parrish Thanks!

@oskarvanderwal Let's support the standard Q-A formatting that we typically use and also the palm2 formatting.

I specifically dislike the BigBench approach and think we should never use it except for the sake of replication. We can create a metaprompt template for it.

…rness into winogender

…t_name from the subsets.

…rness into winogender

…ed for few shot eval)

…rness into winogender

justinphan3110cais · 2024-04-24T22:07:40Z

@oskarvanderwal , Hi Oskar, quick question, why did you decide against stereoset?

…rness into winogender

oskarvanderwal · 2024-05-10T09:54:29Z

@justinphan3110cais, mainly because of time constraints on my end. And since I am mainly interested in evaluating autoregressive models, the implementation of StereoSet would also become more complicated. (It was mainly designed for masked language models, although the authors do suggest an adaptation in their paper). And last of all, StereoSet and CrowS-Pairs are already fairly similar in their approach to measuring bias.

Just to clarify: my main focus is on testing the validity/reliability of said bias benchmarks, more so than measuring bias in LMs.

…rness into winogender

notrichardren · 2024-08-05T21:33:07Z

Hi! How is progress on adding these evaluations to lm-eval-harness?

oskarvanderwal added 11 commits November 29, 2023 14:07

Implementation of Winogender

d65e7da

Minor fixes README.md

117c11d

Merge with upstream big-refactor branch

c54b5d1

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

05480a1

…rness into winogender

Add winogender

14fca4c

Clean winogender utils.py

ae2cc47

Change dataset to one containing All subsets

262d1a9

Flesh out README for BBQ task

f84cb2d

Add missing tasks for BBQ

16acefb

Add simple cooccurrence bias task

55124a2

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

c11fa4c

…rness into winogender

oskarvanderwal added this to the Next Minor Version Release milestone Dec 22, 2023

BBQ: implement original metrics+deal with edge cases

a2d17b3

oskarvanderwal removed this from the Next Minor Version Release milestone Jan 4, 2024

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

6010d8f

…rness into winogender

oskarvanderwal marked this pull request as ready for review January 10, 2024 14:58

oskarvanderwal requested review from haileyschoelkopf and lintangsutawika as code owners January 10, 2024 14:58

oskarvanderwal added 3 commits January 22, 2024 14:15

Fix wrong mask for ambiguated context+rename metrics

2ca5130

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

670cfa8

…rness into winogender

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

8f6f038

…rness into winogender

oskarvanderwal added 22 commits February 13, 2024 15:40

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

f4b99b0

…rness into winogender

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

79007e0

…rness into winogender

Fixes for bbq (multiple choice)

79f1c76

Fix wrong dataset

79e49eb

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

8f8a094

…rness into winogender

CrowS-Pairs: make it easier to use another dataset by removing datase…

85cd271

…t_name from the subsets.

Use simplest prompt possible without description

f6d4273

Merge

b3edb0c

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

62e133e

…rness into winogender

BBQ: Fix np.NaN related bug

55c6ee3

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

95d547e

…rness into winogender

BBQ: Fix wrong aggregation method for disamb accuracy

a2a32c0

BBQ: Make it possible to only evaluate on (dis)ambiguous subset (need…

d90ba33

…ed for few shot eval)

BBQ: fix showing one target in case of few-shot evals

45edd88

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

89fc160

…rness into winogender

BBQ: Fix few-shot example for bbq_generate

b1ce205

BBQ: simplify subtasks

775f773

BBQ: Minimize number of UNK variations to reduce inference time

797ba4c

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

0e42bd1

…rness into winogender

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

604ed66

…rness into winogender

BBQ: Add extra UNK keywords for the generate task

ad81a21

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

59ff7b5

…rness into winogender

justinphan3110cais mentioned this pull request Apr 24, 2024

[In-progress] Add Winogender, simple_cooccurrence_bias, and scruples steven-safeai/lm-evaluation-harness#1

Merged

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

b25a998

…rness into winogender

oskarvanderwal added 3 commits May 22, 2024 16:34

Add a generate_until version of simple_cooccurrence_bias

5b7dedf

Change system/description prompt to include few-shot examples

9c42f07

Merge branch 'main' of https://github.com/EleutherAI/lm-evaluation-ha…

a67e129

…rness into winogender

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add various social bias tasks #1185

Add various social bias tasks #1185

oskarvanderwal commented Dec 21, 2023 •

edited

Loading

CLAassistant commented Dec 21, 2023 •

edited

Loading

oskarvanderwal commented Jan 10, 2024 •

edited

Loading

oskarvanderwal commented Jan 10, 2024

StellaAthena commented Jan 10, 2024 •

edited

Loading

oskarvanderwal commented Jan 10, 2024 •

edited

Loading

StellaAthena commented Jan 11, 2024

Alicia-Parrish commented Jan 12, 2024

oskarvanderwal commented Jan 12, 2024 •

edited

Loading

StellaAthena commented Jan 12, 2024

justinphan3110cais commented Apr 24, 2024

oskarvanderwal commented May 10, 2024 •

edited

Loading

notrichardren commented Aug 5, 2024

Add various social bias tasks #1185

Are you sure you want to change the base?

Add various social bias tasks #1185

Conversation

oskarvanderwal commented Dec 21, 2023 • edited Loading

Winogender

BBQ

StereoSet

CLAassistant commented Dec 21, 2023 • edited Loading

oskarvanderwal commented Jan 10, 2024 • edited Loading

oskarvanderwal commented Jan 10, 2024

StellaAthena commented Jan 10, 2024 • edited Loading

oskarvanderwal commented Jan 10, 2024 • edited Loading

StellaAthena commented Jan 11, 2024

Alicia-Parrish commented Jan 12, 2024

oskarvanderwal commented Jan 12, 2024 • edited Loading

StellaAthena commented Jan 12, 2024

justinphan3110cais commented Apr 24, 2024

oskarvanderwal commented May 10, 2024 • edited Loading

notrichardren commented Aug 5, 2024

oskarvanderwal commented Dec 21, 2023 •

edited

Loading

CLAassistant commented Dec 21, 2023 •

edited

Loading

oskarvanderwal commented Jan 10, 2024 •

edited

Loading

StellaAthena commented Jan 10, 2024 •

edited

Loading

oskarvanderwal commented Jan 10, 2024 •

edited

Loading

oskarvanderwal commented Jan 12, 2024 •

edited

Loading

oskarvanderwal commented May 10, 2024 •

edited

Loading