-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Data loader: support to padding sparse sequential features on the left side #128
Comments
I am assigned this, but I wanted to comment in addition, per |
Update docstrings for issue #1077. This touches the tensorflow and torch dataloader modules and the list_slice op module. The motivation for this is to improve readability. This commit is towards resolving issue #1077 on implementing left padding for sparse sequential features.
Hello everyone, and @benfred @karlhigley @jperez999 in particular, since I see you have significant contributions to these test files: Would you know, is there an existing unit test or tests that you can point me to that would be similar to a unit test to test and implement for this feature request? I have looked in test_torch_dataset.py, test_tf_dataset.py, test_dataloader_backend.py and the In particular, is there a unit test currently available for testing padding on the right, as mentioned in the issue by @gabrielspmoreira? I did find |
@gabrielspmoreira, when you have a chance, would you be able to point me to where in the codebase the existing right padding can be specified? I have looked in the codebase and dataloader modules, but so far I have not been able to find this. I would like to add the |
@benfred @jperez999 @karlhigley I would like to see today, would you have any help, advice, or guidance on my comment above? |
@gabrielspmoreira I want to see with you today, would you have any guidance or advice on my comment above? |
Hey @lesnikow . I have created some time ago a preliminary version of the code that converts the internal NVTabular representation of sparse features (values, offsets) to sparse tensors and @jperez999 ported it and integrated in the NVTabular dataloader later. To give an example, let's say a column in parquet file has sequence/list values, with 3 rows like this
The internal representation of NVTabular (values, offset) would be something like, as the offset informs how many values we have for each row values = [10,20,30,40,50,60]
offsets = [2,1,3] Then the equivalent sparse matrix can be build with values and indices like this values = [10,20,30,40,50,60]
indices = [[0,0],
[0,1],
[1,0],
[2,0],
[2,1],
[2,2] Finally the sparse tensor is converted to dense tensor in this line, which is padded on the right. In this example I assume
To pad the items on the left, I believe we just need to substract the 2nd column of the indices for sparse matrix from the
From the current implementation in NVTabular, I understand that the _get_indices() method is responsible to return the indices for each value. indices[:,1] = seq_limit - 1 - indices[:,1] If we currently don't have tests for those data loader methods that converts the offset representation to sparse and dense features, it woud be good to create such tests using as test case something similar I have described here. |
@gabrielspmoreira This is all very good to know, thank you. @gabrielspmoreira @jperez999 do you have a commit hash that you could point me to so that I can see and review this code change that @gabrielspmoreira mentioned? |
@gabrielspmoreira Would you have any insight or guidance on this? I can see in the torch dataloader where this can be implemented based on your line reference above, but it sounded to me like there is some user-facing method or API for the dataloaders that you are referencing that I am having trouble finding or seeing. In addition to modifying the torch implementation, there would be this user-facing method signature that would need to be updated, and I would need to make sure that this specification is captured correctly by both the existing torch and tensorflow dataloaders. |
Implementation of left padding for issue #1077. This is based on a suggestion by @gabrielspmoreira. I am not exactly sure if this change will completely work, and this is untested due to current failing tests on main on this part of the codebase. But the motivation of this commit is to start a commit for comments, suggestions, and revisions on this issue's implementation.
Update #1077 implementation with some useful feedback from running pre-commit and linters. The motivation is to better pass the CI checks and code consistency.
Implement #1077 update with docstring and type hinting. Note that black adds spaces in the method signature type hinting for the `padding` argument. We add a docstring for _build_spare_tensor(), as this is being modified in this issue's implementation. The motivation for this is improved codebase readability.
Hey Adam. The user-facing class is the DataLoader. For example, in PyTorch it is the |
Update tests for issue #1077. We update the test name to something more descriptive, and update the test docstring to something more informative.
Add tests for issue #1077 for the TensorFlow runtime dataloader. The motivation for this update is increased test coverage.
Update tensorflow dataloader implementation for speed optimization. This implements a suggested revision by @jperez999 for issue #1077.
@gabrielspmoreira I am wondering, is this current
instead of the desired
In particular, the rows are in reversed order. I added the Hence is the current if not, do you have any ideas or approaches on how to use built-in I have been working on doing this, but I have not been able to yet find a way. One of the current difficulties is that methods like |
@gabrielspmoreira Could you also clarify whether your intended padding would produce
or
In other words, is this feature for sliding each row all the way to right-side of the matrix, or to left-pad a constant amount on the left? The latter is significantly easier to implement, from what I have seen currently, than the former. |
Is your feature request related to a problem? Please describe.
The PyT and TF Dataloader support padding list (sparse) features to the right, which means that shorter list sequences will be completed with 0s in the right.
For sequential recommendation, a common use case is to keep the last N user interactions, what can be done either in the preprocessing or in the model side. The NVT Slice op, supports truncating to the last N elements (by providing negative limits).
But it is also useful to be able to do additional truncation in the model side (e.g. truncating with larger max seq. threshold with Slice op and tuning the best max sequence length according to model accuracy and training speed. To do such truncation in the model side, the padding needs to be applied by the Data Loader on the left side of the sequence features, so that when they are converted to dense tensors the padding 0s are placed on the left side. Thus, features could be sliced in the model like
feature[:, -keep_last_n:]
without loosing the sequence features of users with less than N interactions.Describe the solution you'd like
Create an argument for the datalodader
sparse_padding_side
, which by default isright
, but can be set toleft
The text was updated successfully, but these errors were encountered: