Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Configure different training schemes #146

Closed
Eiriksak opened this issue May 13, 2021 · 4 comments
Closed

Configure different training schemes #146

Eiriksak opened this issue May 13, 2021 · 4 comments

Comments

@Eiriksak
Copy link

Hi,
I am currently trying to compare different training schemes when training the biencoder for a new downstream task. Can someone please clarify to me how to properly set up data and config in order to have similar experiments as in table 3 in the paper:
dpr_table3

The code do already use in-batch negative training, as stated in #110. I wonder if:

  1. The top block experiments (not IB) can be configured in the current codebase, or do I have to change the loss computation, e.g. adding slicing function as in How to use in-batch negative and gold when training? #110.
  2. Gold/Random/BM25 in the top block is manually created and added into negative_ctxs/hard_negative_ctxs list in the retriever training data, before setting hard_negatives/other_negatives = #N, while Gold in the middle block means hard_negatives/other_negatives=0 and batch_size = #N+1 which creates #N in-batch negatives by itself.
  3. Bottom block means other_negatives=0, hard_negatives=1/2 and batch_size=32/128 (giving 31/127 Gold in-batch negatives).

I dont know if it makes sense to do experiments with random negatives or pre-computed Gold negatives (adding them to negative_ctxs and set other_negatives=#N) when the IB setting is on.

@Eiriksak
Copy link
Author

It would for instance be interesting to do curriculum learning where it starts with random negatives, then hard negatives from BM25, then hard negatives from previous checkpoint

@vlad-karpukhin
Copy link
Contributor

Hi @Eiriksak ,
Unfortunately we provide only in-batch negative scheme training in this code repository.
You will need to modify data&code if you need to conduct all the experiments from the table above.
As you can see in the #110 , robinsongh381 might have already done some useful changes.

As of your questions list:

  1. You will need to modify loss computation code, but mostly just commenting some lines
  2. This agains depends on the implementation of 1 above - non IB based loss computation. Then you will need to create data for all 3 options or modify existing data format. We don't have this code/data anymore.
  3. Last line means we used 16 per node questions batch on 8 gpu server. Overall this global batch contains: 8x16=128 questions, 128 positives and 128 hard negatives, and 128 regular(gold) negatives(127 per each question).

My feeling is that random negatives are mostly useless, the model learns lexical matching pretty quickly and then the challenge is to distinguish between semantically different but lexically close passages. random negatives are not that super useful for this purpose.

@Eiriksak
Copy link
Author

Thanks for a great reply @vlad-karpukhin!

I guess I still have to include random negatives and hard negatives in the dev set when validate_average_rank runs:

def validate_average_rank(self) -> float:

I can see default is to start this from epoch 30 (val_av_rank_start_epoch) and include 30 hard negatives and 30 other negatives (val_av_rank_hard_neg, val_av_rank_other_neg) per question. Is there any reason why you evaluate with validate_nll for the first 30 epochs instead of this?

@vlad-karpukhin
Copy link
Contributor

Both NLL and average rank are not perfect validation metrics when we measured its correlation with the final retrieval performance over the entire wikipedia.
NLL quickly saturates and is useful at early training stages to measure train dynamics. Then its values stabilizes at some level while, in fact, the model keeps improving (if you do full evaluation)
Average rank is more expensive to calculate but more sensitive and better correlates with the final model performance.
You can enable it much earlier, or use only NLL, it is not critical and there is no any strong logic behind it.
My general recommendation is to always do full evaluation for the last checkpoint and the one selected by average rank metric.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants