-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/deprecate binary cross entropy loss #203
Feature/deprecate binary cross entropy loss #203
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! Thanks for this. Well written PR, too!
I reran your experiments, and can corroborate the results. It all looks above board, too. Very interesting to see such large differences compared to the LogisticRegression head. Perhaps this is caused by a class inbalance or something similar?
I did some further tests with other datasets, i.e. non-binary, and it still works like intended.
I'll put #187 on hold until this is merged.
Thanks for you feedbacks, @tomaarsen! That really helps validate the experiments! I checked the test dataset of Amazon-CF and it matched your guess. The proportion is 10% (503 samples) for 1 vs 90% (5497 samples) for 0. Now I'm thinking why in this PR I didn't get similar results as what we see now. Maybe something is changed? Will look into it. |
@blakechi I had gotten a hunch about the results and the class imbalance. No reputable researchers would evaluate a classification task with a 90:10 class inbalance using accuracy, and most certainly no reasonable data scientist would then only score a ~43% on the task, hah! Lines 27 to 34 in fa1021d
In the paper, I've re-ran the experiments for the
with the following change to diff --git a/scripts/setfit/run_fewshot.py b/scripts/setfit/run_fewshot.py
index 088e25d..ae03992 100644
--- a/scripts/setfit/run_fewshot.py
+++ b/scripts/setfit/run_fewshot.py
@@ -91,7 +91,7 @@ def main():
elif args.is_test_set:
dataset_to_metric = TEST_DATASET_TO_METRIC
else:
- dataset_to_metric = {dataset: "accuracy" for dataset in args.datasets}
+ dataset_to_metric = {dataset: "matthews_correlation" for dataset in args.datasets}
# Configure loss function
loss_class = LOSS_NAME_TO_CLASS[args.loss] This has resulted in these outputs:
This is much more along the line of what we would expect. With other words, this PR seems to work like intended, without any fun additional surprises.
|
Great finding! Ya, that explains why the numbers are so different. Appreciated for the experiments as well! |
Put this PR on hold since #207 might make some changes that overlaps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this ought to be ready for merging now. I resolved some merge conflicts and verified that training after this PR works equivalently to the main branch. I ran several experiments using sst2 and sst5 with success, and clearly the tests pass, too.
This PR is opened to resolve an issue found by @tomaarsen. Thanks @tomaarsen and nice PR #187, I use it as the template!
Pull request overview
Major changes
torch.nn.BCELoss
for binary classification and replace it bytorch.nn.CrossEntropyLoss
as 2-class classification.test_modeling.py
Minor changes
eps
for numerical stability when scalinglogits
withtemperature
Details
Problem
As @tomaarsen mentioned here:
BCELoss
andCrossEntropyLoss
require different data types respectively, which increases complexity to cast the labels into correct data type and makes the code harder to read.Solution
Therefore, this PR deprecates
BCELoss
and useCrossEntropyLoss
for binary classification.Result
I run some experiments on
CR
andEnronSpam
datasets to check whether the performance is changed.Here is the results with N=8 as Table 2 in the paper:
(Hyperparamters: batch size = 16, L2 weight (weight decay) = 0, head learning rate = 1e-2, keep body frozen)
From the results, their performance are similar except Amazon-CF. It looked strange to me and after several runs, I still got similar results at the end. Here is the notebook I used to run experiments for
pytorch (CrossEntropy)
.Conclusion
The advantages of this change (copied from here):
SetFitTrainer
to dedicated class #179 easier.Still need to validate whether the results of
Amazon-CF
is correct. @lewtun could you have a check on the notebook to validate the experiments? Thanks!