Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix invalid mask implementation #359

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

KyuHwan00
Copy link

Fix masking implementation at TextEncoderT5.

Issue

  • The current implementation has different mask implementation with Tokenizer's attention mask that uses 1 for attended positions and 0 for masked positions. That is, the current implementation sets the attending token to 0 and the masking token to 1. So, it prevents proper masking functionality.

Solution

  • Fix the attentionMask implementation.
    token == padToken ? maskToken : padToken => token == padToken ? padToken : maskToken

Copy link
Collaborator

@alejandro-isaza alejandro-isaza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I know we currently don't have a test for this encoder, but would be good to at least test this aspect of it.

@KyuHwan00
Copy link
Author

Thank you for the feedback. I have tested cases to cover the encoder functionality with modified codes, specifically focusing on the aspect you mentioned.

Before talking about the test results, let me tell you about a fix that were added:
attentionMask[0] = bosToken => attentionMask[0] = eosToken
It's the same reason as the previous modification.

Test Env

  • Used baseline T5 model: google/t5-v1_1-base
  • Test device: iPhone 15 pro
  • iOS version: 17.6

Result

Here are the results of my tests (each result only extracts the first 100 elements):

  • The test device's output with current version
    (token == padToken ? padToken : maskToken & attentionMask[0] = bosToken):
[0.4260242, -0.13146661, -0.05443061, 0.4278283, -0.08791171, 0.07497577, 0.27828237, -0.08772106, 0.025646126,
0.13052055, -0.11672092, 0.008639214, -0.25987187, -0.3600051, -0.19493277, -0.041954715, -0.0004980617,
0.27182093, 0.43326083, -0.0035369801, 0.15730482, 0.20143896, 0.110297814, -0.16129531, -0.11004589,
-0.028402163, -0.12952378, -0.25261134, 0.28395727, -0.26019362, 0.13527231, -0.33412644, 0.29622442,
-0.4236819, -0.12266792, 0.036390632, 0.13896675, -0.017689465, -0.016119154, -0.20535961, 0.016372599,
0.3201423, 0.5513491, 0.12941936, 0.0044890335, -0.020545868, 1.0494097, -0.05484867, -0.08088168, 0.04241238,
0.032832634, 0.029508296, -0.13568477, -0.06957459, 0.13856326, 0.04223821, -0.19883841, 0.17156312,
0.20071374, 0.14013502, 0.01915358, -0.015412725, -0.3586532, 0.19870706, -0.032982983, 0.011797173,
-0.029784583, -0.14813012, 0.15077302, 0.34914255, 0.09121524, 0.11863368, 0.1667563, 0.28975943, 0.03273004,
0.31420997, 0.0073143677, -0.09999864, -0.07020134, 0.051774386, -0.20251463, -0.2000018, -0.12373174,
-0.13639234, 0.14170212, -0.1797655, -0.27845755, -0.59155124, -0.09609143, 0.26174963, -0.306249, 0.04012702,
0.043527234, 0.12504794, 0.18454747, 0.47887146, -0.0998203, 0.41656193, 0.056153394, 0.16919865,
-0.034775756]
  • The test device's output with fixed version
    (token == padToken ? maskToken : padToken & attentionMask[0] = eosToken):
[-0.07631074, 0.020202171, 0.107408315, 0.16560224, 0.01083674, -0.26095173, 0.06565774, 0.12822415,
-0.15619999, -0.1736349, 0.17435148, 0.23238882, 0.0015687324, 0.07438593, 0.06850898, -0.05751907,
0.001194903, 0.21067473, 0.37535632, 0.31674048, 0.4225261, -0.15225923, 0.17576945, 0.01786299, -0.2679231,
0.19424905, -0.09404993, 0.0605059, 0.11416181, 0.11132382, 0.11258528, 0.0664307, 0.13322611, -0.25261748,
0.081475206, 0.17849542, 0.17070141, 0.05419595, -0.267129, -0.05937576, 0.1273334, -0.3682904, 0.012873589,
-0.057012416, 0.12541112, -0.01752916, 0.12443144, 0.14152816, 0.053368524, 0.13122968, 0.13614295, -0.16103818,
0.15859632, 0.012001296, -0.13215841, -0.013239053, 0.19885042, 0.2342411, 0.11136218, 0.14910215, -0.14328463,
-0.10696429, -0.14799124, 0.18553326, -0.05146354, -0.3661642, -0.26175126, 0.0993159, -0.04179929, 0.20667534,
0.16085686, 0.04715261, 0.2693186, 0.107595906, -0.0286839, -0.2055541, 0.30528533, -0.08761179, 0.15906587,
-0.08903981, 0.0024355464, 0.017705016, 0.32590625, -0.19161393, -0.15557778, -0.10807836, 0.16928717,
-0.15476838, 0.1307424, -0.42686257, -0.21989107, -0.05892702, 0.14845355, 0.37275925, -0.0014533018,
-0.037888847, 0.16911794, 0.09435404, 0.19486858, -0.18566807, -0.1713055]
  • The baseline's output:
[-0.0763,  0.0202,  0.1074,  0.1656,  0.0108, -0.2610,  0.0657,  0.1282,
-0.1562, -0.1736,  0.1744,  0.2324,  0.0016,  0.0744,  0.0685, -0.0575,
0.0012,  0.2107,  0.3754,  0.3167,  0.4225, -0.1523,  0.1758,  0.0179,
-0.2679,  0.1942, -0.0941,  0.0605,  0.1142,  0.1113,  0.1126,  0.0664,
0.1332, -0.2526,  0.0815,  0.1785,  0.1707,  0.0542, -0.2671, -0.0594,
0.1273, -0.3683,  0.0129, -0.0570,  0.1254, -0.0175,  0.1244,  0.1415,
0.0534,  0.1312,  0.1361, -0.1610,  0.1586,  0.0120, -0.1322, -0.0132,
0.1989,  0.2342,  0.1114,  0.1491, -0.1433, -0.1070, -0.1480,  0.1855,
-0.0515, -0.3662, -0.2618,  0.0993, -0.0418,  0.2067,  0.1609,  0.0472,
0.2693,  0.1076, -0.0287, -0.2056,  0.3053, -0.0876,  0.1591, -0.0890,
0.0024,  0.0177,  0.3259, -0.1916, -0.1556, -0.1081,  0.1693, -0.1548,
0.1307, -0.4269, -0.2199, -0.0589,  0.1485,  0.3728, -0.0015, -0.0379,
0.1691,  0.0944,  0.1949, -0.1857, -0.1713]

The results above show that the results of the modified version match the results of the baseline.

Reason

As I mentioned, the T5 baseline model uses results from the tokenizer as inputs. So, I think this issue is caused by the difference in the way AttentionMask is implemented between Tokenizer and the current version.

Summary

The tests run successfully and validate the expected behavior. However, I will leave the GitHub address where all the code I used for testing is located, as it is possible that my mistakes in converting and testing the model may cause errors in the results. So, please let me know if there are any additional improvements.

@ZachNagengast
Copy link
Contributor

ZachNagengast commented Sep 23, 2024

@KyuHwan00 A good way to test this would to provide T5 embeddings into the model using this logic vs the old logic and comparing the resulting images - I would expect them to be significantly different. Could you provide these two images to represent an "end to end" test, as well as steps to reproduce it, perhaps via the CLI?

Here is a good example of a PR with this kind of end to end test: #357

Also worth mentioning that the current implementation is pinned to a specific version of T5 that was exported to optimize its computation for Apple Silicon, located here: https://huggingface.co/argmaxinc/coreml-stable-diffusion-3-medium/tree/main/TextEncoderT5.mlmodelc. These optimizations could be a source of the differences you are noticing. You may want to compare the outputs from this model as well to the asset you've exported here https://github.com/KyuHwan00/TestT5Encoder/blob/main/t5export.py to the baseline, all three should match for backwards compatibility.

@KyuHwan00
Copy link
Author

@ZachNagengast Thanks for the your feedback. I'll get back with the appropriate test cases ASAP.

@KyuHwan00
Copy link
Author

@ZachNagengast @alejandro-isaza Thanks for your patience first, I report on the results for the appropriate test cases.

  • Prompt: A beautiful painting of flowing colors and styles forming the words "The SD3 research paper is here!", the background is speckled with drops and splashes of paint

  • Result of old version
    old_version

  • Result of my modified version
    modified_version

As you can see from the generated images shown above, you can see that the modified version does not produce an appropriate output for the prompt. Perhaps this difference is made because the code is a specific version of T5 exported to optimize the calculation of Apple Silicon, as @ZachNagengast mentioned. Sorry for missing out on this. Finally, thanks for your feedback.

@ZachNagengast
Copy link
Contributor

Thanks for running these, since the two models are different, perhaps you can adjust your converter code to conform to the pre-exported model. I think the logic makes sense as-is but lmk if you think otherwise:

token == padToken ? maskToken : padToken

  1. When building the attention mask, if we see a pad token in the input ids, we should mask it out, because it doesn't contain meaningful information about the text describing the image.
  2. If it is not a pad token, ie it is meaningful text, we should not mask it, thus the pad token.

attentionMask[0] = bosToken

  1. Since the input ids will have bos tokens, which is the same value as the pad token, we overwrite the first entry in the mask to attend to the bos token on the input, because otherwise it would be masked out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants