Configure different training schemes #146

Eiriksak · 2021-05-13T14:00:49Z

Hi,
I am currently trying to compare different training schemes when training the biencoder for a new downstream task. Can someone please clarify to me how to properly set up data and config in order to have similar experiments as in table 3 in the paper:

The code do already use in-batch negative training, as stated in #110. I wonder if:

The top block experiments (not IB) can be configured in the current codebase, or do I have to change the loss computation, e.g. adding slicing function as in How to use in-batch negative and gold when training? #110.
Gold/Random/BM25 in the top block is manually created and added into negative_ctxs/hard_negative_ctxs list in the retriever training data, before setting hard_negatives/other_negatives = #N, while Gold in the middle block means hard_negatives/other_negatives=0 and batch_size = #N+1 which creates #N in-batch negatives by itself.
Bottom block means other_negatives=0, hard_negatives=1/2 and batch_size=32/128 (giving 31/127 Gold in-batch negatives).

I dont know if it makes sense to do experiments with random negatives or pre-computed Gold negatives (adding them to negative_ctxs and set other_negatives=#N) when the IB setting is on.

Eiriksak · 2021-05-13T14:06:51Z

It would for instance be interesting to do curriculum learning where it starts with random negatives, then hard negatives from BM25, then hard negatives from previous checkpoint

vlad-karpukhin · 2021-05-14T20:15:27Z

Hi @Eiriksak ,
Unfortunately we provide only in-batch negative scheme training in this code repository.
You will need to modify data&code if you need to conduct all the experiments from the table above.
As you can see in the #110 , robinsongh381 might have already done some useful changes.

As of your questions list:

You will need to modify loss computation code, but mostly just commenting some lines
This agains depends on the implementation of 1 above - non IB based loss computation. Then you will need to create data for all 3 options or modify existing data format. We don't have this code/data anymore.
Last line means we used 16 per node questions batch on 8 gpu server. Overall this global batch contains: 8x16=128 questions, 128 positives and 128 hard negatives, and 128 regular(gold) negatives(127 per each question).

My feeling is that random negatives are mostly useless, the model learns lexical matching pretty quickly and then the challenge is to distinguish between semantically different but lexically close passages. random negatives are not that super useful for this purpose.

Eiriksak · 2021-05-18T13:42:45Z

Thanks for a great reply @vlad-karpukhin!

I guess I still have to include random negatives and hard negatives in the dev set when validate_average_rank runs:

DPR/train_dense_encoder.py

Line 309 in 142d448

def validate_average_rank(self) -> float:

I can see default is to start this from epoch 30 (val_av_rank_start_epoch) and include 30 hard negatives and 30 other negatives (val_av_rank_hard_neg, val_av_rank_other_neg) per question. Is there any reason why you evaluate with validate_nll for the first 30 epochs instead of this?

vlad-karpukhin · 2021-05-18T19:31:00Z

Both NLL and average rank are not perfect validation metrics when we measured its correlation with the final retrieval performance over the entire wikipedia.
NLL quickly saturates and is useful at early training stages to measure train dynamics. Then its values stabilizes at some level while, in fact, the model keeps improving (if you do full evaluation)
Average rank is more expensive to calculate but more sensitive and better correlates with the final model performance.
You can enable it much earlier, or use only NLL, it is not critical and there is no any strong logic behind it.
My general recommendation is to always do full evaluation for the last checkpoint and the one selected by average rank metric.

vlad-karpukhin closed this as completed Jun 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure different training schemes #146

Configure different training schemes #146

Eiriksak commented May 13, 2021

Eiriksak commented May 13, 2021

vlad-karpukhin commented May 14, 2021

Eiriksak commented May 18, 2021

vlad-karpukhin commented May 18, 2021

Configure different training schemes #146

Configure different training schemes #146

Comments

Eiriksak commented May 13, 2021

Eiriksak commented May 13, 2021

vlad-karpukhin commented May 14, 2021

Eiriksak commented May 18, 2021

vlad-karpukhin commented May 18, 2021