Different results between training and eval #40

eyuansu62 · 2021-12-23T02:41:12Z

Sorry to bother you! But I find another interesting problem.
When I start training (with train.json) and get a result in the middle process such as:

      "epoch": 2304.0,
      "eval_exact_match": 0.6460348162475822,
      "eval_exec": 0.6460348162475822,
      "eval_loss": 0.41825902462005615,
      "eval_runtime": 90.718,
      "eval_samples_per_second": 11.398,
      "step": 2304

It can be seen that eval_exact_match is around 0.64.

But if I run evaluation mode (with eval.json), I will get:

   "eval_exact_match": 0.6247582205029013,
    "eval_exec": 0.6431334622823984,
    "eval_loss": 0.41071268916130066,
    "eval_runtime": 244.047,
    "eval_samples": 1034,
    "eval_samples_per_second": 4.237

The eval_exact_match is around 0.62
And the eval.json is

    "run_name": "t5+picard-spider-eval",
    "model_name_or_path": "train/checkpoint-2304",
    "dataset": "spider",
    "source_prefix": "",
    "schema_serialization_type": "peteshaw",
    "schema_serialization_randomized": false,
    "schema_serialization_with_db_id": true,
    "schema_serialization_with_db_content": true,
    "normalize_query": true,
    "target_with_db_id": true,
    "output_dir": "/eval",
    "cache_dir": "/transformers_cache",
    "do_train": false,
    "do_eval": true,
    "fp16": false,
    "per_device_eval_batch_size": 5,
    "seed": 1,
    "report_to": ["tensorboard"],
    "predict_with_generate": true,
    "num_beams": 4,
    "num_beam_groups": 1,
    "diversity_penalty": 0.0,
    "max_val_samples": 1034,
    "use_picard": false,
    "launch_picard": false,
    "picard_mode": "parse_with_guards",
    "picard_schedule": "incremental",
    "picard_max_tokens_to_check": 2,
    "eval_accumulation_steps": 1,
    "metric_config": "both",
    "val_max_target_length": 512,
    "val_max_time": 1200

It is different about 2%. Have you ever seen its problem?

The text was updated successfully, but these errors were encountered:

tscholak · 2021-12-23T03:24:20Z

Yes, I've encountered this problem. For this reason I always report the numbers that are reproducible based on the saved checkpoints and never those during training.
I have been unable to pinpoint the origin of the issue, I think though it has to do with mixed precision training and lossy conversions between floating point formats when saving the model weights. If I knew how to reproduce this in a minimal example I'd open an issue with hf transformers.

tscholak · 2021-12-23T13:52:27Z

@eyuansu62 something I noticed: are you aware that your exact match and exec accuracies are identical? That doesn't seem right, have you made modifications to that code?

tscholak · 2021-12-23T13:57:05Z

Another thought: the content matching code I borrowed from Victoria Lin et al's BRIDGE model does not necessarily produce the same column values between runs. This instability can explain the discrepancy partially but not fully. If you like to stare at diffs, try comparing the predictions_[step].json files between training and evaluation.

eyuansu62 · 2021-12-24T06:21:56Z

something I noticed: are you aware that your exact match and exec accuracies are identical? That doesn't seem right, have you made modifications to that code?

I do not modify the metric code. And the same result seems to be a coincidence in 2304 epoch. Because there is:

      "epoch": 3008.0,
      "eval_exact_match": 0.6450676982591876,
      "eval_exec": 0.6421663442940039,
      "eval_loss": 0.45334360003471375,
      "eval_runtime": 96.9869,
      "eval_samples_per_second": 10.661,
      "step": 3008

eyuansu62 · 2021-12-27T09:19:58Z

content matching code

Recently, I carefully compare the difference between training and evaluation. There are many kinds of error, such as key word error: asc, desc, wrong table name, wrong column name, etc.
Because I focus on exact match, the column values seem unimportant to me.

tscholak added the bug Something isn't working label Dec 23, 2021

Timothyxxx mentioned this issue May 13, 2022

The different results between eval mode and test mode. xlang-ai/UnifiedSKG#26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different results between training and eval #40

Different results between training and eval #40

eyuansu62 commented Dec 23, 2021 •

edited

Loading

tscholak commented Dec 23, 2021

tscholak commented Dec 23, 2021

tscholak commented Dec 23, 2021

eyuansu62 commented Dec 24, 2021 •

edited

Loading

eyuansu62 commented Dec 27, 2021

Different results between training and eval #40

Different results between training and eval #40

Comments

eyuansu62 commented Dec 23, 2021 • edited Loading

tscholak commented Dec 23, 2021

tscholak commented Dec 23, 2021

tscholak commented Dec 23, 2021

eyuansu62 commented Dec 24, 2021 • edited Loading

eyuansu62 commented Dec 27, 2021

eyuansu62 commented Dec 23, 2021 •

edited

Loading

eyuansu62 commented Dec 24, 2021 •

edited

Loading