Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packing without cross-contamination #25452

Closed
ToddMorrill opened this issue Aug 11, 2023 · 3 comments
Closed

Packing without cross-contamination #25452

ToddMorrill opened this issue Aug 11, 2023 · 3 comments

Comments

@ToddMorrill
Copy link

Feature request

Is there something within Hugging Face that prevents latter subsequences from attending to earlier subsequences when you use packing? Is there a way to implement attention masking so that subsequences only attend to tokens within the subsequence within a packed example?

As it currently stands:

  1. the attention mask fed into transformers is a 1d sequence and we need to be able to pass a 2d sequence to specify the appropriate attention mask with multiple sequences
  2. this will interact with positional embeddings, because the position should be relative to the start of the example, not the sequence it's packed into
  3. this will impact the loss calculation at the boundaries of examples. In particular, EOS tokens shouldn't have loss calculated for predicting the start of the next example.
  4. and there may be other impacts I'm not thinking of.

There appear to be a few challenges to overcome but nevertheless it seems like an important feature to have.

Motivation

I find it unsettling that when packing, we're just simply letting the latter subsequences' tokens attend to the first subsequences tokens. Packed sequences could have nothing to do with one another and I don't want to contaminate examples. At the same time, I don't want to give up the throughput gains of packing sequences.

I suppose I could sort my dataset by length to minimize the wasted computation (i.e. pack approximately equal length examples into batches together) as a decent solution. I'm not sure if this will impact model performance in any way though.

This feature request has been raised several times: huggingface/trl#302
#17726
I think tensorflow implements this and GraphCore talks about it here and in their paper.

Your contribution

This doesn't strike me as a "first contribution" but if someone wants to coach me, I can give it a shot.

@ydshieh
Copy link
Collaborator

ydshieh commented Aug 23, 2023

Also in #6661

@ydshieh
Copy link
Collaborator

ydshieh commented Aug 23, 2023

Hi @ToddMorrill

For existing models, I am afraid that it's unlikely to make changes for this feature. If there are new models that support this natively in the original modeling code, that could be the case when that model is ported into transformers.

@ToddMorrill
Copy link
Author

It’s a real bummer because it seems like an important feature to have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants