Fix invalid mask implementation #359

KyuHwan00 · 2024-09-21T06:07:04Z

Fix masking implementation at TextEncoderT5.

Issue

The current implementation has different mask implementation with Tokenizer's attention mask that uses 1 for attended positions and 0 for masked positions. That is, the current implementation sets the attending token to 0 and the masking token to 1. So, it prevents proper masking functionality.

Solution

Fix the attentionMask implementation.
token == padToken ? maskToken : padToken => token == padToken ? padToken : maskToken

alejandro-isaza

Thanks. I know we currently don't have a test for this encoder, but would be good to at least test this aspect of it.

KyuHwan00 · 2024-09-23T11:11:39Z

Thank you for the feedback. I have tested cases to cover the encoder functionality with modified codes, specifically focusing on the aspect you mentioned.

Before talking about the test results, let me tell you about a fix that were added:
attentionMask[0] = bosToken => attentionMask[0] = eosToken
It's the same reason as the previous modification.

Test Env

Used baseline T5 model: google/t5-v1_1-base
Test device: iPhone 15 pro
iOS version: 17.6

Result

Here are the results of my tests (each result only extracts the first 100 elements):

The test device's output with current version
(token == padToken ? padToken : maskToken & attentionMask[0] = bosToken):

[0.4260242, -0.13146661, -0.05443061, 0.4278283, -0.08791171, 0.07497577, 0.27828237, -0.08772106, 0.025646126,
0.13052055, -0.11672092, 0.008639214, -0.25987187, -0.3600051, -0.19493277, -0.041954715, -0.0004980617,
0.27182093, 0.43326083, -0.0035369801, 0.15730482, 0.20143896, 0.110297814, -0.16129531, -0.11004589,
-0.028402163, -0.12952378, -0.25261134, 0.28395727, -0.26019362, 0.13527231, -0.33412644, 0.29622442,
-0.4236819, -0.12266792, 0.036390632, 0.13896675, -0.017689465, -0.016119154, -0.20535961, 0.016372599,
0.3201423, 0.5513491, 0.12941936, 0.0044890335, -0.020545868, 1.0494097, -0.05484867, -0.08088168, 0.04241238,
0.032832634, 0.029508296, -0.13568477, -0.06957459, 0.13856326, 0.04223821, -0.19883841, 0.17156312,
0.20071374, 0.14013502, 0.01915358, -0.015412725, -0.3586532, 0.19870706, -0.032982983, 0.011797173,
-0.029784583, -0.14813012, 0.15077302, 0.34914255, 0.09121524, 0.11863368, 0.1667563, 0.28975943, 0.03273004,
0.31420997, 0.0073143677, -0.09999864, -0.07020134, 0.051774386, -0.20251463, -0.2000018, -0.12373174,
-0.13639234, 0.14170212, -0.1797655, -0.27845755, -0.59155124, -0.09609143, 0.26174963, -0.306249, 0.04012702,
0.043527234, 0.12504794, 0.18454747, 0.47887146, -0.0998203, 0.41656193, 0.056153394, 0.16919865,
-0.034775756]

The test device's output with fixed version
(token == padToken ? maskToken : padToken & attentionMask[0] = eosToken):

[-0.07631074, 0.020202171, 0.107408315, 0.16560224, 0.01083674, -0.26095173, 0.06565774, 0.12822415,
-0.15619999, -0.1736349, 0.17435148, 0.23238882, 0.0015687324, 0.07438593, 0.06850898, -0.05751907,
0.001194903, 0.21067473, 0.37535632, 0.31674048, 0.4225261, -0.15225923, 0.17576945, 0.01786299, -0.2679231,
0.19424905, -0.09404993, 0.0605059, 0.11416181, 0.11132382, 0.11258528, 0.0664307, 0.13322611, -0.25261748,
0.081475206, 0.17849542, 0.17070141, 0.05419595, -0.267129, -0.05937576, 0.1273334, -0.3682904, 0.012873589,
-0.057012416, 0.12541112, -0.01752916, 0.12443144, 0.14152816, 0.053368524, 0.13122968, 0.13614295, -0.16103818,
0.15859632, 0.012001296, -0.13215841, -0.013239053, 0.19885042, 0.2342411, 0.11136218, 0.14910215, -0.14328463,
-0.10696429, -0.14799124, 0.18553326, -0.05146354, -0.3661642, -0.26175126, 0.0993159, -0.04179929, 0.20667534,
0.16085686, 0.04715261, 0.2693186, 0.107595906, -0.0286839, -0.2055541, 0.30528533, -0.08761179, 0.15906587,
-0.08903981, 0.0024355464, 0.017705016, 0.32590625, -0.19161393, -0.15557778, -0.10807836, 0.16928717,
-0.15476838, 0.1307424, -0.42686257, -0.21989107, -0.05892702, 0.14845355, 0.37275925, -0.0014533018,
-0.037888847, 0.16911794, 0.09435404, 0.19486858, -0.18566807, -0.1713055]

The baseline's output:

[-0.0763,  0.0202,  0.1074,  0.1656,  0.0108, -0.2610,  0.0657,  0.1282,
-0.1562, -0.1736,  0.1744,  0.2324,  0.0016,  0.0744,  0.0685, -0.0575,
0.0012,  0.2107,  0.3754,  0.3167,  0.4225, -0.1523,  0.1758,  0.0179,
-0.2679,  0.1942, -0.0941,  0.0605,  0.1142,  0.1113,  0.1126,  0.0664,
0.1332, -0.2526,  0.0815,  0.1785,  0.1707,  0.0542, -0.2671, -0.0594,
0.1273, -0.3683,  0.0129, -0.0570,  0.1254, -0.0175,  0.1244,  0.1415,
0.0534,  0.1312,  0.1361, -0.1610,  0.1586,  0.0120, -0.1322, -0.0132,
0.1989,  0.2342,  0.1114,  0.1491, -0.1433, -0.1070, -0.1480,  0.1855,
-0.0515, -0.3662, -0.2618,  0.0993, -0.0418,  0.2067,  0.1609,  0.0472,
0.2693,  0.1076, -0.0287, -0.2056,  0.3053, -0.0876,  0.1591, -0.0890,
0.0024,  0.0177,  0.3259, -0.1916, -0.1556, -0.1081,  0.1693, -0.1548,
0.1307, -0.4269, -0.2199, -0.0589,  0.1485,  0.3728, -0.0015, -0.0379,
0.1691,  0.0944,  0.1949, -0.1857, -0.1713]

The results above show that the results of the modified version match the results of the baseline.

Reason

As I mentioned, the T5 baseline model uses results from the tokenizer as inputs. So, I think this issue is caused by the difference in the way AttentionMask is implemented between Tokenizer and the current version.

Summary

The tests run successfully and validate the expected behavior. However, I will leave the GitHub address where all the code I used for testing is located, as it is possible that my mistakes in converting and testing the model may cause errors in the results. So, please let me know if there are any additional improvements.

ZachNagengast · 2024-09-23T15:36:16Z

@KyuHwan00 A good way to test this would to provide T5 embeddings into the model using this logic vs the old logic and comparing the resulting images - I would expect them to be significantly different. Could you provide these two images to represent an "end to end" test, as well as steps to reproduce it, perhaps via the CLI?

Here is a good example of a PR with this kind of end to end test: #357

Also worth mentioning that the current implementation is pinned to a specific version of T5 that was exported to optimize its computation for Apple Silicon, located here: https://huggingface.co/argmaxinc/coreml-stable-diffusion-3-medium/tree/main/TextEncoderT5.mlmodelc. These optimizations could be a source of the differences you are noticing. You may want to compare the outputs from this model as well to the asset you've exported here https://github.com/KyuHwan00/TestT5Encoder/blob/main/t5export.py to the baseline, all three should match for backwards compatibility.

KyuHwan00 · 2024-09-24T17:04:09Z

@ZachNagengast Thanks for the your feedback. I'll get back with the appropriate test cases ASAP.

KyuHwan00 · 2024-09-29T17:39:52Z

@ZachNagengast @alejandro-isaza Thanks for your patience first, I report on the results for the appropriate test cases.

Prompt: A beautiful painting of flowing colors and styles forming the words "The SD3 research paper is here!", the background is speckled with drops and splashes of paint
Result of old version
Result of my modified version

As you can see from the generated images shown above, you can see that the modified version does not produce an appropriate output for the prompt. Perhaps this difference is made because the code is a specific version of T5 exported to optimize the calculation of Apple Silicon, as @ZachNagengast mentioned. Sorry for missing out on this. Finally, thanks for your feedback.

ZachNagengast · 2024-09-30T15:52:20Z

Thanks for running these, since the two models are different, perhaps you can adjust your converter code to conform to the pre-exported model. I think the logic makes sense as-is but lmk if you think otherwise:

token == padToken ? maskToken : padToken

When building the attention mask, if we see a pad token in the input ids, we should mask it out, because it doesn't contain meaningful information about the text describing the image.
If it is not a pad token, ie it is meaningful text, we should not mask it, thus the pad token.

attentionMask[0] = bosToken

Since the input ids will have bos tokens, which is the same value as the pad token, we overwrite the first entry in the mask to attend to the bos token on the input, because otherwise it would be masked out.

alejandro-isaza approved these changes Sep 21, 2024

View reviewed changes

Fix invalid mask implementation

4361735

KyuHwan00 force-pushed the dev branch from b149946 to 4361735 Compare September 23, 2024 09:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix invalid mask implementation #359

Fix invalid mask implementation #359

KyuHwan00 commented Sep 21, 2024

alejandro-isaza left a comment

KyuHwan00 commented Sep 23, 2024

ZachNagengast commented Sep 23, 2024 •

edited

Loading

KyuHwan00 commented Sep 24, 2024

KyuHwan00 commented Sep 29, 2024

ZachNagengast commented Sep 30, 2024

Fix invalid mask implementation #359

Are you sure you want to change the base?

Fix invalid mask implementation #359

Conversation

KyuHwan00 commented Sep 21, 2024

Fix masking implementation at TextEncoderT5.

Issue

Solution

alejandro-isaza left a comment

Choose a reason for hiding this comment

KyuHwan00 commented Sep 23, 2024

Test Env

Result

Reason

Summary

ZachNagengast commented Sep 23, 2024 • edited Loading

KyuHwan00 commented Sep 24, 2024

KyuHwan00 commented Sep 29, 2024

ZachNagengast commented Sep 30, 2024

ZachNagengast commented Sep 23, 2024 •

edited

Loading