-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE REQUEST] Importers that consume tf or pytorch Dataset, and can produce identical Datasets after #118
Comments
This is a phenomenal idea @elistevens! Will put this on the roadmap! |
@hhsecond, would you have time to put together the first draft of this for the next release? (after 0.3) |
Sounds good |
Hey @hhsecond just wanted to ping you on this since it seems like an appropriate feature for 0.4 If you don't have the time, I may have some next week. Let me know if you think you'll be able to take this |
@rlizzo I am finishing up the plugin module clean up (Sorry for the delay). I am sort of stuffed up till Wednesday with PyCon India and GPU stuffs from CircleCI. If you think this can wait till then, I am happy to take this up right after Wednesday |
@hhsecond, I am currently working on this issue. |
@hhsecond, I am working on the load method's implementation of the plugin. Will keep you updated. |
Hey @GauthamPughaz, just wanted to check in and see how this is going? do you need any assistance? Any ETA when we might be able to see a first draft (it doesn't need to be pretty, but it may save time if we can check out an overview of the flow and suggest any necessary changes before you get too far in development.) Thanks for volunteering to contribute this! It's a great feature which will be much appreciated! |
@rlizzo, I am halfway through it. I may need some assistance in understanding the internal working of arraysets to develop the feature better. I will consult this with @hhsecond. But, I am entirely held up this week and the next. I will be definitely available after that, and I would love to work on this feature. |
Ok. Thanks for the update! I think that you should definitely talk with either myself or @hhsecond before getting too far then. You shouldn't actually have to care about how Why don't we try to set up a call with the three of us sometime in the next two weeks so we can discuss further. We can coordinate times through @hhsecond if thats ok? Thanks for the hard work! |
@rlizzo , thanks for the clarification. I think we can have a call by Tuesday or Wednesday next week to wrap up this. |
@GauthamPughaz Let's do the call today? We would like to push this to the upcoming release (0.4) and hence the hurry. Sorry |
@hhsecond, no problem. We can connect through a slack call tomorrow after 8:30 pm IST. |
Is your feature request related to a problem? Please describe.
Existing DL projects are already going to have a data pipeline. Often, these are going to result in tf or pytorch Datasets. Having to replace all of the existing mechanisms by which those datasets are created and managed with hangar-specific code is a barrier to adoption.
Describe the solution you'd like
I think that import routines should be able to consume a third-party
Dataset
instance, inspect it, and store the relevant data in hangar. The intent would be that hangar could then produce a functionally identicalDataset
for use in training, but without having to go back to the raw data.This would change hangar adoption best practice from "replace your data pipeline" to "just insert this mostly-transparent step in the middle." Once the project is fully committed to using hangar, then the architecture can be revisited, if needed.
It would be nice if it could also vacuum up a dir tree of .tfrecord files, but that's a little less well-defined.
Describe alternatives you've considered
N/A
Additional context
N/A
The text was updated successfully, but these errors were encountered: