-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Albert fine tuning does not always converge #831
Comments
If anyone is feeling motivated to check this out feel free! But heads up it may be a little tedious to debug with just a colab GPU. |
Might be a better option to use P100 on Kaggle. And also, we can reduce the sequence length to 128 to enable larger batch sizes? |
Yep, takes 4 minutes/epoch on Kaggle with batch size 32 and sequence length 128! |
@mattdangerw, seems to give consistently good results with LR = 2e-5 and Adam Edit: Oh, right. Out of 5 runs, it fails in one run (gives an accuracy of 50%). Hmmm |
@mattdangerw It's not uncommon for machine learning models to show varying levels of performance on different runs, especially when dealing with small datasets or complex models. In the case of the script you provided, the issue may be related to the initialization of the model's parameters or the optimization algorithm used during training. As you suggested, one way to tackle this problem is to experiment with different optimization algorithms and learning rates. AdamW, for example, is a variant of the Adam optimizer that can help mitigate the effects of weight decay and improve generalization. You could also try using a different learning rate schedule, such as a cosine annealing schedule, to help the model converge to a good solution. Another approach would be to use a different initialization strategy for the model's parameters. The ALBERT model uses a unique parameter-sharing strategy that involves decomposing the embedding and transformer layers into shared and unshared subspaces. This may require a different initialization strategy than traditional models. You could try initializing the model's parameters with a different distribution, such as a truncated normal distribution, and see if it improves performance. Finally, you mentioned running training multiple times with a freshly initialized model. This is a good idea to help ensure that any improvements in performance are not just due to chance. You may also want to consider using a technique such as k-fold cross-validation to get a better estimate of the model's performance. Overall, the key to improving the performance of the ALBERT model on this task will be to experiment with different hyperparameters and initialization strategies and to carefully monitor the model's performance over multiple runs. Code: Importing Librariesimport keras_nlp Load IMDb movie reviews dataset.imdb_train, imdb_test = tfds.load( Define ALBERT model with custom initialization and optimizer.initializer = tf.keras.initializers.TruncatedNormal(stddev=0.02) Define learning rate schedule.num_train_examples = len(imdb_train) Compile model.classifier.compile( Train model.history = classifier.fit( Evaluate model.test_loss, test_acc = classifier.evaluate(imdb_test) @abheesht17 Is it write ? |
Hey @mattdangerw and hey @chenmoneygithub the issue pertains with AlbertMaskedLM models as well. I've been playing around with it for convergence for fulfilment of #833. Here is the cite how albert performs against four different LRs Training on IMDB dataset, full training script can be found here : https://www.kaggle.com/code/shivanshuman/does-tensorflow-task-converge |
@shivance, did you try AdamW and some form of LR scheduler? |
@shivance thanks that is helpful! Though I think the annoying part of this problem is the failure state is random across entire training runs. So the analysis we are really going to need would include, say, 10 trails per optimizer/learning rate approach. It's definitely going to be compute intensive to dig into this! More just general musings on this, the instability of fine-tuning is a somewhat well known problem for all of these models. When most papers report a GLUE score, they are really taking the top score out of, say, 5 trails per individual GLUE task. Here's a whole paper on the problem -> https://arxiv.org/pdf/2006.04884.pdf, there are some proposed solutions we can look at in there but I haven't dug too deeply. Note that our goal should not be to remove all instability in fine-tuning (that's probably not feasible), but to provide a better default starting place than most users would find on their own. At the end of the day, if you really care about fine-tuning on a specific dataset, nothing will beat a hyper-parameter search for that specific dataset. |
Looks like lowering the default learning rate a bit can fix most of this. Doing that now. |
It appears that albert classification sometimes fails catastrophically with our compilation defaults. On a standard colab GPU, the follow script will sometime give good results, and sometime hover at 50% accuracy.
This may be something we face with all our models to some extent, but the issue seems exacerbated on albert in particular. We should experiment with different optimizers (e.g. AdamW) and lower learning rates. We will need to run training multiple times with a fresh, randomly initialized model, to see if we are getting improvement on this problem.
The text was updated successfully, but these errors were encountered: