Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Data loader: support to padding sparse sequential features on the left side #128

Open
gabrielspmoreira opened this issue Aug 30, 2021 · 11 comments

Comments

@gabrielspmoreira
Copy link
Member

gabrielspmoreira commented Aug 30, 2021

Is your feature request related to a problem? Please describe.
The PyT and TF Dataloader support padding list (sparse) features to the right, which means that shorter list sequences will be completed with 0s in the right.
For sequential recommendation, a common use case is to keep the last N user interactions, what can be done either in the preprocessing or in the model side. The NVT Slice op, supports truncating to the last N elements (by providing negative limits).
But it is also useful to be able to do additional truncation in the model side (e.g. truncating with larger max seq. threshold with Slice op and tuning the best max sequence length according to model accuracy and training speed. To do such truncation in the model side, the padding needs to be applied by the Data Loader on the left side of the sequence features, so that when they are converted to dense tensors the padding 0s are placed on the left side. Thus, features could be sliced in the model like feature[:, -keep_last_n:] without loosing the sequence features of users with less than N interactions.

Describe the solution you'd like
Create an argument for the datalodader sparse_padding_side, which by default is right, but can be set to left

@gabrielspmoreira gabrielspmoreira changed the title [FEA] Data loader [FEA] Data loader: support to padding sparse sequential features on the left side Aug 30, 2021
@lesnikow
Copy link

I am assigned this, but I wanted to comment in addition, per CONTRIBUTING.md, and say that I have begun working on this.

lesnikow referenced this issue in NVIDIA-Merlin/NVTabular Sep 14, 2021
Update docstrings for issue #1077. This touches the tensorflow
and torch dataloader modules and the list_slice op module. The
motivation for this is to improve readability. This commit is
towards resolving issue #1077 on implementing left padding
for sparse sequential features.
@lesnikow
Copy link

lesnikow commented Sep 14, 2021

Hello everyone, and @benfred @karlhigley @jperez999 in particular, since I see you have significant contributions to these test files:

Would you know, is there an existing unit test or tests that you can point me to that would be similar to a unit test to test and implement for this feature request? I have looked in test_torch_dataset.py, test_tf_dataset.py, test_dataloader_backend.py and the list_slice test in the test_ops.py test file. But I am not exactly sure currently what would be a good unit test template to test an implementation of adding right padding to the two framework dataloaders.

In particular, is there a unit test currently available for testing padding on the right, as mentioned in the issue by @gabrielspmoreira?

I did find def test_nested_list(): authored by @zhenwendai in test_tf_dataloader.py that has, for instance, pad_data_col = ..., but I am not sure whether this is testing padding in addition to testing the loading of nested lists. Thank you.

@lesnikow
Copy link

lesnikow commented Sep 15, 2021

@gabrielspmoreira, when you have a chance, would you be able to point me to where in the codebase the existing right padding can be specified? I have looked in the codebase and dataloader modules, but so far I have not been able to find this. I would like to add the sparse_padding_side argument at this point in the codebase, or someplace logically related.

@lesnikow
Copy link

@benfred @jperez999 @karlhigley I would like to see today, would you have any help, advice, or guidance on my comment above?

@lesnikow
Copy link

@gabrielspmoreira I want to see with you today, would you have any guidance or advice on my comment above?

@gabrielspmoreira
Copy link
Member Author

Hey @lesnikow . I have created some time ago a preliminary version of the code that converts the internal NVTabular representation of sparse features (values, offsets) to sparse tensors and @jperez999 ported it and integrated in the NVTabular dataloader later.

To give an example, let's say a column in parquet file has sequence/list values, with 3 rows like this

[10,20]
[30]
[40,50,60]

The internal representation of NVTabular (values, offset) would be something like, as the offset informs how many values we have for each row

values = [10,20,30,40,50,60]
offsets = [2,1,3]

Then the equivalent sparse matrix can be build with values and indices like this

values = [10,20,30,40,50,60]
indices = [[0,0],
                 [0,1],
                 [1,0],
                 [2,0],
                 [2,1],
                 [2,2]

Finally the sparse tensor is converted to dense tensor in this line, which is padded on the right. In this example I assume seq_limit=5 for this feature

[10. 20,  0,  0, 0]
[30,   0,  0,  0, 0]
[40. 50, 60, 0, 0]

To pad the items on the left, I believe we just need to substract the 2nd column of the indices for sparse matrix from the seq_limit ,so that it becomes

indices = [[0,4],
                 [0,3],
                 [1,4],
                 [2,4],
                 [2,3],
                 [2,2]

From the current implementation in NVTabular, I understand that the _get_indices() method is responsible to return the indices for each value.
So maybe including this code after this line (if padding_direction==True) can make the trick ;)

indices[:,1] = seq_limit - 1 - indices[:,1]

If we currently don't have tests for those data loader methods that converts the offset representation to sparse and dense features, it woud be good to create such tests using as test case something similar I have described here.

@lesnikow
Copy link

lesnikow commented Sep 16, 2021

@gabrielspmoreira This is all very good to know, thank you. @gabrielspmoreira @jperez999 do you have a commit hash that you could point me to so that I can see and review this code change that @gabrielspmoreira mentioned?

@lesnikow
Copy link

lesnikow commented Sep 16, 2021

@gabrielspmoreira, when you have a chance, would you be able to point me to where in the codebase the existing right padding can be specified? I have looked in the codebase and dataloader modules, but so far I have not been able to find this. I would like to add the sparse_padding_side argument at this point in the codebase, or someplace logically related.

@gabrielspmoreira Would you have any insight or guidance on this? I can see in the torch dataloader where this can be implemented based on your line reference above, but it sounded to me like there is some user-facing method or API for the dataloaders that you are referencing that I am having trouble finding or seeing. In addition to modifying the torch implementation, there would be this user-facing method signature that would need to be updated, and I would need to make sure that this specification is captured correctly by both the existing torch and tensorflow dataloaders.

lesnikow referenced this issue in NVIDIA-Merlin/NVTabular Sep 16, 2021
Implementation of left padding for issue #1077. This is based on a suggestion
by @gabrielspmoreira. I am not exactly sure if this change will completely
work, and this is untested due to current failing tests on main on this part of
the codebase. But the motivation of this commit is to start a commit for
comments, suggestions, and revisions on this issue's implementation.
lesnikow referenced this issue in NVIDIA-Merlin/NVTabular Sep 16, 2021
Update #1077 implementation with some useful feedback from running pre-commit
and linters. The motivation is to better pass the CI checks and code
consistency.
lesnikow referenced this issue in NVIDIA-Merlin/NVTabular Sep 16, 2021
Implement #1077 update with docstring and type hinting. Note that black adds
spaces in the method signature type hinting for the `padding` argument. We add
a docstring for _build_spare_tensor(), as this is being modified in this
issue's implementation. The motivation for this is improved codebase
readability.
@gabrielspmoreira
Copy link
Member Author

Hey Adam. The user-facing class is the DataLoader. For example, in PyTorch it is the TorchAsyncItr class and in TF it is KerasSequenceLoader.
We should create an argument sparse_padding_side, which should accept 'right' (default) or 'right'.

lesnikow referenced this issue in NVIDIA-Merlin/NVTabular Sep 28, 2021
Update tests for issue #1077. We update the test name to something more
descriptive, and update the test docstring to something more
informative.
lesnikow referenced this issue in NVIDIA-Merlin/NVTabular Sep 28, 2021
Add tests for issue #1077 for the TensorFlow runtime dataloader. The
motivation for this update is increased test coverage.
lesnikow referenced this issue in NVIDIA-Merlin/NVTabular Sep 30, 2021
Update tensorflow dataloader implementation for speed optimization. This
implements a suggested revision by @jperez999 for issue #1077.
@lesnikow
Copy link

lesnikow commented Sep 30, 2021

@gabrielspmoreira I am wondering, is this current torch implementation at dd9927e sufficiently fast for your intended use case? I have used the approach that you suggested for sparse tensors in torch, but I had to modify it with a helper function. Just using the method you outlined would give

[0, 0, 0, 20, 10]
[0,  0,  0, 0, 30]
[0, 0, 60, 50, 40]

instead of the desired

[0, 0, 0, 10, 20]
[0,  0,  0, 0, 30]
[0, 0, 40, 50, 60]

In particular, the rows are in reversed order. I added the _build_sparse_tensor_helper_process_column() method to modify the indices to the correct order. This method does use a python while block. The current tensorflow implementation at that referenced commit uses tf operations, resulting in more speed-optimized code.

Hence is the current torch implementation with its python while block fast enough for your desired use case?

if not, do you have any ideas or approaches on how to use built-in torch.sparse class methods or other methods to make this more optimized or vectorized?

I have been working on doing this, but I have not been able to yet find a way. One of the current difficulties is that methods like torch.pad only operate on dense tensors, and only pads constant amounts, whereas we want to pad row-dependent amounts. A similar approach to the current tensorflow implementation would need something analogous to tensorflow's RaggedTensor class and its methods. There is a proposal to add the analogous NestedTensor class to torch, but this is only implemented in a separate package, and not part of torch proper.

@lesnikow
Copy link

@gabrielspmoreira Could you also clarify whether your intended padding would produce

[0, 0, 0, 10, 20]
[0, 0, 0, 0, 30]
[0, 0, 40, 50, 60]

or

[0, 0, 10, 20, 0]
[0, 0, 30, 0, 0]
[0, 0, 40, 50, 60]

In other words, is this feature for sliding each row all the way to right-side of the matrix, or to left-pad a constant amount on the left? The latter is significantly easier to implement, from what I have seen currently, than the former.

@lesnikow lesnikow removed their assignment Nov 10, 2021
@karlhigley karlhigley transferred this issue from NVIDIA-Merlin/NVTabular Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants