[FEA] Data loader: support to padding sparse sequential features on the left side #128

gabrielspmoreira · 2021-08-30T15:28:26Z

Is your feature request related to a problem? Please describe.
The PyT and TF Dataloader support padding list (sparse) features to the right, which means that shorter list sequences will be completed with 0s in the right.
For sequential recommendation, a common use case is to keep the last N user interactions, what can be done either in the preprocessing or in the model side. The NVT Slice op, supports truncating to the last N elements (by providing negative limits).
But it is also useful to be able to do additional truncation in the model side (e.g. truncating with larger max seq. threshold with Slice op and tuning the best max sequence length according to model accuracy and training speed. To do such truncation in the model side, the padding needs to be applied by the Data Loader on the left side of the sequence features, so that when they are converted to dense tensors the padding 0s are placed on the left side. Thus, features could be sliced in the model like feature[:, -keep_last_n:] without loosing the sequence features of users with less than N interactions.

Describe the solution you'd like
Create an argument for the datalodader sparse_padding_side, which by default is right, but can be set to left

The text was updated successfully, but these errors were encountered:

lesnikow · 2021-09-10T17:04:33Z

I am assigned this, but I wanted to comment in addition, per CONTRIBUTING.md, and say that I have begun working on this.

Update docstrings for issue #1077. This touches the tensorflow and torch dataloader modules and the list_slice op module. The motivation for this is to improve readability. This commit is towards resolving issue #1077 on implementing left padding for sparse sequential features.

lesnikow · 2021-09-14T23:50:26Z

Hello everyone, and @benfred @karlhigley @jperez999 in particular, since I see you have significant contributions to these test files:

Would you know, is there an existing unit test or tests that you can point me to that would be similar to a unit test to test and implement for this feature request? I have looked in test_torch_dataset.py, test_tf_dataset.py, test_dataloader_backend.py and the list_slice test in the test_ops.py test file. But I am not exactly sure currently what would be a good unit test template to test an implementation of adding right padding to the two framework dataloaders.

In particular, is there a unit test currently available for testing padding on the right, as mentioned in the issue by @gabrielspmoreira?

I did find def test_nested_list(): authored by @zhenwendai in test_tf_dataloader.py that has, for instance, pad_data_col = ..., but I am not sure whether this is testing padding in addition to testing the loading of nested lists. Thank you.

lesnikow · 2021-09-15T00:09:23Z

@gabrielspmoreira, when you have a chance, would you be able to point me to where in the codebase the existing right padding can be specified? I have looked in the codebase and dataloader modules, but so far I have not been able to find this. I would like to add the sparse_padding_side argument at this point in the codebase, or someplace logically related.

lesnikow · 2021-09-15T21:59:02Z

@benfred @jperez999 @karlhigley I would like to see today, would you have any help, advice, or guidance on my comment above?

lesnikow · 2021-09-15T22:00:31Z

@gabrielspmoreira I want to see with you today, would you have any guidance or advice on my comment above?

gabrielspmoreira · 2021-09-16T00:59:10Z

Hey @lesnikow . I have created some time ago a preliminary version of the code that converts the internal NVTabular representation of sparse features (values, offsets) to sparse tensors and @jperez999 ported it and integrated in the NVTabular dataloader later.

To give an example, let's say a column in parquet file has sequence/list values, with 3 rows like this

[10,20]
[30]
[40,50,60]

The internal representation of NVTabular (values, offset) would be something like, as the offset informs how many values we have for each row

values = [10,20,30,40,50,60]
offsets = [2,1,3]

Then the equivalent sparse matrix can be build with values and indices like this

values = [10,20,30,40,50,60]
indices = [[0,0],
                 [0,1],
                 [1,0],
                 [2,0],
                 [2,1],
                 [2,2]

Finally the sparse tensor is converted to dense tensor in this line, which is padded on the right. In this example I assume seq_limit=5 for this feature

[10. 20,  0,  0, 0]
[30,   0,  0,  0, 0]
[40. 50, 60, 0, 0]

To pad the items on the left, I believe we just need to substract the 2nd column of the indices for sparse matrix from the seq_limit ,so that it becomes

indices = [[0,4],
                 [0,3],
                 [1,4],
                 [2,4],
                 [2,3],
                 [2,2]

From the current implementation in NVTabular, I understand that the _get_indices() method is responsible to return the indices for each value.
So maybe including this code after this line (if padding_direction==True) can make the trick ;)

indices[:,1] = seq_limit - 1 - indices[:,1]

If we currently don't have tests for those data loader methods that converts the offset representation to sparse and dense features, it woud be good to create such tests using as test case something similar I have described here.

lesnikow · 2021-09-16T18:23:20Z

@gabrielspmoreira This is all very good to know, thank you. @gabrielspmoreira @jperez999 do you have a commit hash that you could point me to so that I can see and review this code change that @gabrielspmoreira mentioned?

lesnikow · 2021-09-16T19:11:56Z

@gabrielspmoreira, when you have a chance, would you be able to point me to where in the codebase the existing right padding can be specified? I have looked in the codebase and dataloader modules, but so far I have not been able to find this. I would like to add the sparse_padding_side argument at this point in the codebase, or someplace logically related.

@gabrielspmoreira Would you have any insight or guidance on this? I can see in the torch dataloader where this can be implemented based on your line reference above, but it sounded to me like there is some user-facing method or API for the dataloaders that you are referencing that I am having trouble finding or seeing. In addition to modifying the torch implementation, there would be this user-facing method signature that would need to be updated, and I would need to make sure that this specification is captured correctly by both the existing torch and tensorflow dataloaders.

@gabrielspmoreira

Implementation of left padding for issue #1077. This is based on a suggestion by @gabrielspmoreira. I am not exactly sure if this change will completely work, and this is untested due to current failing tests on main on this part of the codebase. But the motivation of this commit is to start a commit for comments, suggestions, and revisions on this issue's implementation.

Update #1077 implementation with some useful feedback from running pre-commit and linters. The motivation is to better pass the CI checks and code consistency.

Implement #1077 update with docstring and type hinting. Note that black adds spaces in the method signature type hinting for the `padding` argument. We add a docstring for _build_spare_tensor(), as this is being modified in this issue's implementation. The motivation for this is improved codebase readability.

gabrielspmoreira · 2021-09-17T00:21:19Z

Hey Adam. The user-facing class is the DataLoader. For example, in PyTorch it is the TorchAsyncItr class and in TF it is KerasSequenceLoader.
We should create an argument sparse_padding_side, which should accept 'right' (default) or 'right'.

Update tests for issue #1077. We update the test name to something more descriptive, and update the test docstring to something more informative.

Add tests for issue #1077 for the TensorFlow runtime dataloader. The motivation for this update is increased test coverage.

@jperez999

Update tensorflow dataloader implementation for speed optimization. This implements a suggested revision by @jperez999 for issue #1077.

lesnikow · 2021-09-30T23:23:23Z

@gabrielspmoreira I am wondering, is this current torch implementation at dd9927e sufficiently fast for your intended use case? I have used the approach that you suggested for sparse tensors in torch, but I had to modify it with a helper function. Just using the method you outlined would give

[0, 0, 0, 20, 10]
[0,  0,  0, 0, 30]
[0, 0, 60, 50, 40]

instead of the desired

[0, 0, 0, 10, 20]
[0,  0,  0, 0, 30]
[0, 0, 40, 50, 60]

In particular, the rows are in reversed order. I added the _build_sparse_tensor_helper_process_column() method to modify the indices to the correct order. This method does use a python while block. The current tensorflow implementation at that referenced commit uses tf operations, resulting in more speed-optimized code.

Hence is the current torch implementation with its python while block fast enough for your desired use case?

if not, do you have any ideas or approaches on how to use built-in torch.sparse class methods or other methods to make this more optimized or vectorized?

I have been working on doing this, but I have not been able to yet find a way. One of the current difficulties is that methods like torch.pad only operate on dense tensors, and only pads constant amounts, whereas we want to pad row-dependent amounts. A similar approach to the current tensorflow implementation would need something analogous to tensorflow's RaggedTensor class and its methods. There is a proposal to add the analogous NestedTensor class to torch, but this is only implemented in a separate package, and not part of torch proper.

lesnikow · 2021-09-30T23:49:32Z

@gabrielspmoreira Could you also clarify whether your intended padding would produce

[0, 0, 0, 10, 20]
[0, 0, 0, 0, 30]
[0, 0, 40, 50, 60]

or

[0, 0, 10, 20, 0]
[0, 0, 30, 0, 0]
[0, 0, 40, 50, 60]

In other words, is this feature for sliding each row all the way to right-side of the matrix, or to left-pad a constant amount on the left? The latter is significantly easier to implement, from what I have seen currently, than the former.

gabrielspmoreira changed the title ~~[FEA] Data loader~~ [FEA] Data loader: support to padding sparse sequential features on the left side Aug 30, 2021

EvenOldridge assigned EvenOldridge and lesnikow and unassigned EvenOldridge Sep 7, 2021

lesnikow mentioned this issue Sep 16, 2021

[REVIEW] Implement Feature Request from #1077 on Left Padding NVIDIA-Merlin/NVTabular#1126

Closed

lesnikow referenced this issue in NVIDIA-Merlin/NVTabular Sep 16, 2021

Update #1077 implementation

d9e457d

Update #1077 implementation with some useful feedback from running pre-commit and linters. The motivation is to better pass the CI checks and code consistency.

lesnikow referenced this issue in NVIDIA-Merlin/NVTabular Sep 28, 2021

Update tests

0be389e

Update tests for issue #1077. We update the test name to something more descriptive, and update the test docstring to something more informative.

lesnikow referenced this issue in NVIDIA-Merlin/NVTabular Sep 28, 2021

Add tests for the TensorFlow runtime dataloader

d93f9c5

Add tests for issue #1077 for the TensorFlow runtime dataloader. The motivation for this update is increased test coverage.

lesnikow referenced this issue in NVIDIA-Merlin/NVTabular Sep 30, 2021

Update tensorflow dataloader implementation

b305afa

Update tensorflow dataloader implementation for speed optimization. This implements a suggested revision by @jperez999 for issue #1077.

lesnikow removed their assignment Nov 10, 2021

karlhigley transferred this issue from NVIDIA-Merlin/NVTabular Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Data loader: support to padding sparse sequential features on the left side #128

[FEA] Data loader: support to padding sparse sequential features on the left side #128

gabrielspmoreira commented Aug 30, 2021 •

edited

Loading

lesnikow commented Sep 10, 2021

lesnikow commented Sep 14, 2021 •

edited

Loading

lesnikow commented Sep 15, 2021 •

edited

Loading

lesnikow commented Sep 15, 2021

lesnikow commented Sep 15, 2021

gabrielspmoreira commented Sep 16, 2021

lesnikow commented Sep 16, 2021 •

edited

Loading

lesnikow commented Sep 16, 2021 •

edited

Loading

gabrielspmoreira commented Sep 17, 2021

lesnikow commented Sep 30, 2021 •

edited

Loading

lesnikow commented Sep 30, 2021

[FEA] Data loader: support to padding sparse sequential features on the left side #128

[FEA] Data loader: support to padding sparse sequential features on the left side #128

Comments

gabrielspmoreira commented Aug 30, 2021 • edited Loading

lesnikow commented Sep 10, 2021

lesnikow commented Sep 14, 2021 • edited Loading

lesnikow commented Sep 15, 2021 • edited Loading

lesnikow commented Sep 15, 2021

lesnikow commented Sep 15, 2021

gabrielspmoreira commented Sep 16, 2021

lesnikow commented Sep 16, 2021 • edited Loading

lesnikow commented Sep 16, 2021 • edited Loading

gabrielspmoreira commented Sep 17, 2021

lesnikow commented Sep 30, 2021 • edited Loading

lesnikow commented Sep 30, 2021

gabrielspmoreira commented Aug 30, 2021 •

edited

Loading

lesnikow commented Sep 14, 2021 •

edited

Loading

lesnikow commented Sep 15, 2021 •

edited

Loading

lesnikow commented Sep 16, 2021 •

edited

Loading

lesnikow commented Sep 16, 2021 •

edited

Loading

lesnikow commented Sep 30, 2021 •

edited

Loading