Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different results between training and eval #40

Open
eyuansu62 opened this issue Dec 23, 2021 · 5 comments
Open

Different results between training and eval #40

eyuansu62 opened this issue Dec 23, 2021 · 5 comments
Labels
bug Something isn't working

Comments

@eyuansu62
Copy link

eyuansu62 commented Dec 23, 2021

Sorry to bother you! But I find another interesting problem.
When I start training (with train.json) and get a result in the middle process such as:

      "epoch": 2304.0,
      "eval_exact_match": 0.6460348162475822,
      "eval_exec": 0.6460348162475822,
      "eval_loss": 0.41825902462005615,
      "eval_runtime": 90.718,
      "eval_samples_per_second": 11.398,
      "step": 2304

It can be seen that eval_exact_match is around 0.64.

But if I run evaluation mode (with eval.json), I will get:

   "eval_exact_match": 0.6247582205029013,
    "eval_exec": 0.6431334622823984,
    "eval_loss": 0.41071268916130066,
    "eval_runtime": 244.047,
    "eval_samples": 1034,
    "eval_samples_per_second": 4.237

The eval_exact_match is around 0.62
And the eval.json is

    "run_name": "t5+picard-spider-eval",
    "model_name_or_path": "train/checkpoint-2304",
    "dataset": "spider",
    "source_prefix": "",
    "schema_serialization_type": "peteshaw",
    "schema_serialization_randomized": false,
    "schema_serialization_with_db_id": true,
    "schema_serialization_with_db_content": true,
    "normalize_query": true,
    "target_with_db_id": true,
    "output_dir": "/eval",
    "cache_dir": "/transformers_cache",
    "do_train": false,
    "do_eval": true,
    "fp16": false,
    "per_device_eval_batch_size": 5,
    "seed": 1,
    "report_to": ["tensorboard"],
    "predict_with_generate": true,
    "num_beams": 4,
    "num_beam_groups": 1,
    "diversity_penalty": 0.0,
    "max_val_samples": 1034,
    "use_picard": false,
    "launch_picard": false,
    "picard_mode": "parse_with_guards",
    "picard_schedule": "incremental",
    "picard_max_tokens_to_check": 2,
    "eval_accumulation_steps": 1,
    "metric_config": "both",
    "val_max_target_length": 512,
    "val_max_time": 1200

It is different about 2%. Have you ever seen its problem?

@tscholak
Copy link
Collaborator

Yes, I've encountered this problem. For this reason I always report the numbers that are reproducible based on the saved checkpoints and never those during training.
I have been unable to pinpoint the origin of the issue, I think though it has to do with mixed precision training and lossy conversions between floating point formats when saving the model weights. If I knew how to reproduce this in a minimal example I'd open an issue with hf transformers.

@tscholak tscholak added the bug Something isn't working label Dec 23, 2021
@tscholak
Copy link
Collaborator

@eyuansu62 something I noticed: are you aware that your exact match and exec accuracies are identical? That doesn't seem right, have you made modifications to that code?

@tscholak
Copy link
Collaborator

Another thought: the content matching code I borrowed from Victoria Lin et al's BRIDGE model does not necessarily produce the same column values between runs. This instability can explain the discrepancy partially but not fully. If you like to stare at diffs, try comparing the predictions_[step].json files between training and evaluation.

@eyuansu62
Copy link
Author

eyuansu62 commented Dec 24, 2021

something I noticed: are you aware that your exact match and exec accuracies are identical? That doesn't seem right, have you made modifications to that code?

I do not modify the metric code. And the same result seems to be a coincidence in 2304 epoch. Because there is:

      "epoch": 3008.0,
      "eval_exact_match": 0.6450676982591876,
      "eval_exec": 0.6421663442940039,
      "eval_loss": 0.45334360003471375,
      "eval_runtime": 96.9869,
      "eval_samples_per_second": 10.661,
      "step": 3008

@eyuansu62
Copy link
Author

content matching code

Recently, I carefully compare the difference between training and evaluation. There are many kinds of error, such as key word error: asc, desc, wrong table name, wrong column name, etc.
Because I focus on exact match, the column values seem unimportant to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants