Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] Importers that consume tf or pytorch Dataset, and can produce identical Datasets after #118

Open
elistevens opened this issue Aug 25, 2019 · 13 comments
Assignees
Labels
enhancement New feature or request

Comments

@elistevens
Copy link

Is your feature request related to a problem? Please describe.
Existing DL projects are already going to have a data pipeline. Often, these are going to result in tf or pytorch Datasets. Having to replace all of the existing mechanisms by which those datasets are created and managed with hangar-specific code is a barrier to adoption.

Describe the solution you'd like
I think that import routines should be able to consume a third-party Dataset instance, inspect it, and store the relevant data in hangar. The intent would be that hangar could then produce a functionally identical Dataset for use in training, but without having to go back to the raw data.

This would change hangar adoption best practice from "replace your data pipeline" to "just insert this mostly-transparent step in the middle." Once the project is fully committed to using hangar, then the architecture can be revisited, if needed.

It would be nice if it could also vacuum up a dir tree of .tfrecord files, but that's a little less well-defined.

Describe alternatives you've considered
N/A

Additional context
N/A

@elistevens elistevens added the enhancement New feature or request label Aug 25, 2019
@rlizzo
Copy link
Member

rlizzo commented Sep 4, 2019

This is a phenomenal idea @elistevens! Will put this on the roadmap!

@rlizzo
Copy link
Member

rlizzo commented Sep 5, 2019

@hhsecond, would you have time to put together the first draft of this for the next release? (after 0.3)

@hhsecond
Copy link
Member

hhsecond commented Sep 6, 2019

Sounds good

@hhsecond hhsecond self-assigned this Sep 6, 2019
@rlizzo
Copy link
Member

rlizzo commented Oct 9, 2019

Hey @hhsecond just wanted to ping you on this since it seems like an appropriate feature for 0.4 If you don't have the time, I may have some next week. Let me know if you think you'll be able to take this

@hhsecond
Copy link
Member

hhsecond commented Oct 9, 2019

@rlizzo I am finishing up the plugin module clean up (Sorry for the delay). I am sort of stuffed up till Wednesday with PyCon India and GPU stuffs from CircleCI. If you think this can wait till then, I am happy to take this up right after Wednesday

@gauthampughazhendhi
Copy link

@hhsecond, I am currently working on this issue.

@gauthampughazhendhi
Copy link

gauthampughazhendhi commented Oct 15, 2019

@hhsecond, I am working on the load method's implementation of the plugin. Will keep you updated.

@rlizzo
Copy link
Member

rlizzo commented Oct 21, 2019

Hey @GauthamPughaz, just wanted to check in and see how this is going? do you need any assistance? Any ETA when we might be able to see a first draft (it doesn't need to be pretty, but it may save time if we can check out an overview of the flow and suggest any necessary changes before you get too far in development.)

Thanks for volunteering to contribute this! It's a great feature which will be much appreciated!

@gauthampughazhendhi
Copy link

@rlizzo, I am halfway through it. I may need some assistance in understanding the internal working of arraysets to develop the feature better. I will consult this with @hhsecond. But, I am entirely held up this week and the next. I will be definitely available after that, and I would love to work on this feature.

@rlizzo
Copy link
Member

rlizzo commented Oct 23, 2019

Ok. Thanks for the update! I think that you should definitely talk with either myself or @hhsecond before getting too far then.

You shouldn't actually have to care about how arraysets work internally to develop this feature. In general we don't allow any access to the internal workings of the arraysets outside of the public API, even for internal hangar operations. This is because the actual data reader/writer backend methods are massively protected by weakref proxy's and context managers to ensure that all operations occur safely. Going outside of these protections could open up some nasty bugs/behavior if not done properly.

Why don't we try to set up a call with the three of us sometime in the next two weeks so we can discuss further. We can coordinate times through @hhsecond if thats ok?

Thanks for the hard work!
Rick

@gauthampughazhendhi
Copy link

@rlizzo , thanks for the clarification. I think we can have a call by Tuesday or Wednesday next week to wrap up this.

@hhsecond
Copy link
Member

@GauthamPughaz Let's do the call today? We would like to push this to the upcoming release (0.4) and hence the hurry. Sorry

@gauthampughazhendhi
Copy link

@hhsecond, no problem. We can connect through a slack call tomorrow after 8:30 pm IST.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants