-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add various social bias tasks #1185
base: main
Are you sure you want to change the base?
Conversation
BBQ required me to implement custom metrics. Interestingly, everything works when running each subset of BBQ individually, but I run into a problem when running instead the This is a minor issue for me, but worth diving into? |
For now, I decided against including StereoSet. |
Does the BBQ implementation replicate any papers @oskarvanderwal? And if the Winogender implementation matched the Winogrande one (from this repo which should be the same as the GPT-3 paper) according to |
|
@oskarvanderwal I don't see anything in the BigScience fork of this repo or in the BLOOM paper about BBQ. The BBQ GitHub repo says the following:
The BigBench version appears to be: I haven't figured out how to read HELM prompts yet... should get to learning that :/ @Alicia-Parrish is there a way that you would like us to format the BBQ evaluation for generative models? |
For generative models, you can follow the method that we used for evaluation in the PaLM2 technical report -- see Appendix D.6. The input format was just "{{context}}.\n\nQ: {{question}}\nA:" (so, none of the multiple choice answer options are shown). Then the response can be evaluated by just using string search for each of the two individuals introduced in the context (this is easier when you use the metadata fields than the answer options). This worked pretty well and only required manual coding of 14% of examples. If you prefer evaluating still using the multiple choice format, I would just recommend using a few different formats that the model has seen before, to make sure that model behavior isn't super different across format options. |
@StellaAthena re: mentioning BLOOM using BBQ, you're right in that BLOOM didn't explicitly evaluate on BBQ. They did however evaluate on HELM and report their results on the bias category in Figure 10. I wrongly assumed this targeted the BBQ dataset, but looking at the HELM paper more carefully I don't think this is the case. I'll remove the mention from the column. About implementing BBQ: do you think we should follow the methodology described in the PaLM2 technical report (which I think is similar to HELM)? Or do you prefer the BIG-bench approach or something more closely to how eval-harness currently evaluates multiple choice? |
@Alicia-Parrish Thanks! @oskarvanderwal Let's support the standard Q-A formatting that we typically use and also the palm2 formatting. I specifically dislike the BigBench approach and think we should never use it except for the sake of replication. We can create a metaprompt template for it. |
…t_name from the subsets.
…ed for few shot eval)
@oskarvanderwal , Hi Oskar, quick question, why did you decide against stereoset? |
@justinphan3110cais, mainly because of time constraints on my end. And since I am mainly interested in evaluating autoregressive models, the implementation of StereoSet would also become more complicated. (It was mainly designed for masked language models, although the authors do suggest an adaptation in their paper). And last of all, StereoSet and CrowS-Pairs are already fairly similar in their approach to measuring bias. Just to clarify: my main focus is on testing the validity/reliability of said bias benchmarks, more so than measuring bias in LMs. |
Hi! How is progress on adding these evaluations to lm-eval-harness? |
This PR implements various popular benchmarks for evaluating LMs for social biases. I also aim to have these validated where possible: e.g., by comparing with existing implementations or results, or at least have a second person indicate it makes sense. @StellaAthena suggested to get some help in validating the implementations.
StereoSetWinogender
This implementation follows the description from the paper "Language Models are Few-Shot Learners".
However, for validation, I compare my results with the results reported in the LLaMA paper, who should have the same implementation.
Unfortunately, their results (reported in Table 13 of the LLaMA paper) are not reproduced by my implementation:
For the 7B LLaMA model (using the
llama-cpp-python
server with mps), I got the following for different prompts:Prompt 1: "{{sentence}} ‘{{pronoun.capitalize()}}’ refers to the"
Prompt 2: "{{sentence}}\n\n‘{{pronoun.capitalize()}}’ refers to the"
Prompt 1 + description: "Answer these questions:\n\n"
BBQ
StereoSet
[ ] Finish implementation[ ] Compare GPT-2 results with StereoSet leaderboard