-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Physics GRE task added #1655
base: main
Are you sure you want to change the base?
Physics GRE task added #1655
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for the PR! This is a great contribution.
- Have you tested any models, such as Mistral, on this task? If so, do you have any sample outputs I could review?
- It might make sense to add
..._maj1
variants of this task, or to have maj8 be an available variant and default to not running maj8, due to computational costs. - Other than that:
Are
# Remove the data points that have images in the input. 100 -> 76
dataset = dataset.filter(lambda x: x["has_image"] is False)
# Remove the data points without groud truth label. 76 -> 75
dataset = dataset.filter(lambda x: x["target_scores"] is not None)
# All questions must have one and only one correct answer.
assert (
len(dataset.filter(lambda x: sum(x["target_scores"].values()) != 1)) == 0
), "Zero or More than one correct answers."
all of these filters used by Inflection in their evaluation of this task? E.g., for >1 correct answer, is this actually not possible for the root test and an error in the benchmark files, or could we support >1 correct answer questions too?
Most welcome. Thanks for your prompt feedback.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much!
However:
I think the link you shared means that a model attempting to select more than one answer would be penalized, not that multiple answers can't both be counted.
I was thinking for computational cost reasons, that having a separate task variant which does greedy generation and only reports Maj@1 would be beneficial. How long (and on what GPU) did it take to run mistral on these tasks?
@@ -0,0 +1,52 @@ | |||
dataset_kwargs: null # any extra keyword arguments that should be passed to the dataset constructor, e.g. `data_dir`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file should be renamed to default_yaml
so that we don't try to register it as its own task!
Thanks a lot for getting back.
|
@haileyschoelkopf what is
based on? If you're referring to the line about the scoring function taking both correct and incorrect answers into account, you're misunderstanding. |
To make things concrete: let the correct answer be
The evaluator will get 1 and 4 judged as expected. But even though 3 is wrong, the regex parsing/filtering will say the answer to be |
No, I am referring to the fact that this text from the link:
Doesn't imply that, say, Question 5 might permit both an answer of solely "A" and an answer of solely "B" to be correct. So we should check whether any such questions are permitted by the test, since this link doesn't expressly confirm/deny this |
I don't think so. I think this is just saying that missing, malformed, and incorrect answers are all treated the same way. I don't read this as implying that some questions have multiple correct answers and that in such cases you should only answer with one of them. |
This PR adds the Physics GRE dataset released in the Public Inflection Benchmark by @InflectionAI. Please refer here for the details.
It solved the issue #1554.