Packing without cross-contamination #25452

ToddMorrill · 2023-08-11T02:59:35Z

Feature request

Is there something within Hugging Face that prevents latter subsequences from attending to earlier subsequences when you use packing? Is there a way to implement attention masking so that subsequences only attend to tokens within the subsequence within a packed example?

As it currently stands:

the attention mask fed into transformers is a 1d sequence and we need to be able to pass a 2d sequence to specify the appropriate attention mask with multiple sequences
this will interact with positional embeddings, because the position should be relative to the start of the example, not the sequence it's packed into
this will impact the loss calculation at the boundaries of examples. In particular, EOS tokens shouldn't have loss calculated for predicting the start of the next example.
and there may be other impacts I'm not thinking of.

There appear to be a few challenges to overcome but nevertheless it seems like an important feature to have.

Motivation

I find it unsettling that when packing, we're just simply letting the latter subsequences' tokens attend to the first subsequences tokens. Packed sequences could have nothing to do with one another and I don't want to contaminate examples. At the same time, I don't want to give up the throughput gains of packing sequences.

I suppose I could sort my dataset by length to minimize the wasted computation (i.e. pack approximately equal length examples into batches together) as a decent solution. I'm not sure if this will impact model performance in any way though.

This feature request has been raised several times: huggingface/trl#302
#17726
I think tensorflow implements this and GraphCore talks about it here and in their paper.

Your contribution

This doesn't strike me as a "first contribution" but if someone wants to coach me, I can give it a shot.

ydshieh · 2023-08-23T14:53:12Z

Also in #6661

ydshieh · 2023-08-23T15:03:48Z

Hi @ToddMorrill

For existing models, I am afraid that it's unlikely to make changes for this feature. If there are new models that support this natively in the original modeling code, that could be the case when that model is ported into transformers.

ToddMorrill · 2023-08-24T17:20:00Z

It’s a real bummer because it seems like an important feature to have.

ToddMorrill closed this as completed Aug 24, 2023

wdykas mentioned this issue Sep 25, 2023

Packing in SFT huggingface/trl#805

Closed

hanyin88 mentioned this issue Dec 23, 2023

Should we adjust attention mask when packing data? meta-llama/llama-cookbook#341

Closed

RdoubleA mentioned this issue May 9, 2024

Sample packing for map datasets with correct RoPE encoding and no cross-contamination pytorch/torchtune#875

Merged

jukofyork mentioned this issue Dec 14, 2024

Fix yield_sequences_from_token_batch to avoid losing the removed token when BOS is prepended tdrussell/qlora-pipe#40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Packing without cross-contamination #25452

Packing without cross-contamination #25452

ToddMorrill commented Aug 11, 2023

ydshieh commented Aug 23, 2023

ydshieh commented Aug 23, 2023

ToddMorrill commented Aug 24, 2023

Packing without cross-contamination #25452

Packing without cross-contamination #25452

Comments

ToddMorrill commented Aug 11, 2023

Feature request

Motivation

Your contribution

ydshieh commented Aug 23, 2023

ydshieh commented Aug 23, 2023

ToddMorrill commented Aug 24, 2023