Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about padding the input sequence #294

Open
BaleChen opened this issue Jul 20, 2023 · 3 comments
Open

Question about padding the input sequence #294

BaleChen opened this issue Jul 20, 2023 · 3 comments

Comments

@BaleChen
Copy link

stanford_alpaca/train.py

Lines 90 to 99 in 761dc5b

tokenized_list = [
tokenizer(
text,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
)
for text in strings
]

In this snippet of code, from what I understand, the padding is not added since using "longest" mode on a single sequence is equivalent to adding no paddings as per this doc. Is it right? So the padding for each prompt is added by the data collator instead of here.

I wonder if it would be clearer if you just write padding=False here or add a comment about it.

@srhthu
Copy link

srhthu commented Sep 1, 2023

I think so.. Actually they use the dynamic padding by the "DataCollatorForSupervisedDataset". My concern is should the padding tokens be at left rather than right? The other repo https://github.com/tloen/alpaca-lora padding to the left, which makes sense for batch training.

@maksimstw
Copy link

maksimstw commented Sep 23, 2023

Agree with @srhthu. I think left padding makes more sense, but the train.py used right padding instead. I think the code they use to train Alpaca is simply not correct for batch training. See the explanation here.

@BaleChen
Copy link
Author

Hey @maksimstw,

My previous understanding is that batch inference with decoder models requires us to do left padding. But at the fine-tuning stage, right-side padding is okay as long as we set the attention mask correctly and turn pad tokens to -100 when calculating loss.

Is it the case that we can just simply use left padding for both training and inference in generation tasks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants