Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hdfs support #300

Closed
mingruimingrui opened this issue Jul 1, 2021 · 15 comments
Closed

hdfs support #300

mingruimingrui opened this issue Jul 1, 2021 · 15 comments
Labels
binding/rust Issues for the Rust crate enhancement New feature or request help wanted Extra attention is needed

Comments

@mingruimingrui
Copy link

mingruimingrui commented Jul 1, 2021

Description

HDFS storage support.

Use Case
A significant portion of companies dealing with big data uses HDFS as the backend storage solution of choice for long-term persistent data storage and processing. Having this would be very beneficial for the place I currently work at.

@mingruimingrui mingruimingrui added the enhancement New feature or request label Jul 1, 2021
@houqp houqp added binding/rust Issues for the Rust crate help wanted Extra attention is needed labels Jul 2, 2021
@zijie0
Copy link
Contributor

zijie0 commented Dec 1, 2021

@houqp I did some search on HDFS libraries for Rust and found this crate: https://crates.io/crates/fs-hdfs . But it seems to have a lot of dependencies to run. Do you have any suggestions on this?

@houqp
Copy link
Member

houqp commented Dec 2, 2021

@yjshen wrote a wrapper for libhdfs3: https://github.com/datafusion-contrib/datafusion-hdfs-native. This one is a lot leaner since it only has a c++ dependency. Perhaps you can work with him to convert that binding into its own crate? Right now it's coupled with the datafusion hdfs object store implementation.

@zijie0
Copy link
Contributor

zijie0 commented Dec 2, 2021

Good to know that. @yjshen do you have time to convert your work into a new crate? It would be very helpful for other Rust projects too.

@yjshen
Copy link
Contributor

yjshen commented Dec 3, 2021

@houqp Do you think datafusion-contrib is the right place to hold this hdfs rust repo? or should I make it under my account?

@houqp
Copy link
Member

houqp commented Dec 3, 2021

@yjshen up to you, since you are the author :)

@zijie0
Copy link
Contributor

zijie0 commented Dec 17, 2021

Hey @yjshen , any update on this? Currently we are using MinIO on HDFS as a workaround. But it seems to be not a sustainable way: minio/minio#13927 . We are all counting on you now :)

@yjshen
Copy link
Contributor

yjshen commented Dec 17, 2021

@zijie0
Copy link
Contributor

zijie0 commented Dec 17, 2021

Cool! @yjshen

@mingruimingrui
Copy link
Author

mingruimingrui commented Jan 19, 2022

Sorry everyone I realized that this feature might not be trivial to support.
HDFS and the apache stack can be very complicated to support. Optimally users connecting to HDFS should be using the correct client version intended by the cluster maintainers. This can often mean that the user should use the version of HDFS as provided in the system they are running it on. So to support this feature I think the options are as following

  1. Do dynamic linking to existing system libraries.
  2. Include hdfs dependencies but and multiple versions of delta-rs based on hdfs version.
  3. Add dependency to wrapper library like hdfs-native by @yjshen.

But this will open up a whole new bag of worms that I don't think is good for any project to experience. Not to mention both approaches will increase the installation complexity to end-users (that most users probably would not be too experienced with).

A workaround is to mount HDFS and access it like a regular filesystem and allow delta-rs to access hdfs this way though this is just a suggestion.

@mingruimingrui
Copy link
Author

mingruimingrui commented Jan 19, 2022

I find that the solution by @yjshen is a really sound one but installation will likely differ across systems.
Eg. the company I work at has a custom hdfs version. So I'll have to make sure to build hdfs-native correctly (and not follow the instructions in the repo).
The barrier of adoption might be high.

@yjshen
Copy link
Contributor

yjshen commented Jan 19, 2022

Hi @mingruimingrui , I've met with a similar problem, a customized HDFS version similar to yours. To make it worse, we even use HDFS with federation that isn't supported by native CPP implementations.

Since the motivation for me to implement hdfs-native as well as datafusion-hdfs is to call DataFusion through JNI in Spark executors to boost the performance. We currently adopt another approach: create an HDFS client in JVM and share it through JNI, as a workaround for the situation that our customized HDFS only maintains its Java Client.

@houqp
Copy link
Member

houqp commented Jan 21, 2022

Eg. the company I work at has a custom hdfs version. So I'll have to make sure to build hdfs-native correctly (and not follow the instructions in the repo).
The barrier of adoption might be high.

Yeah, unfortunately for custom setup, a custom build will be needed for native applications. I am guessing clickhouse has the same problem as well.

@yjshen
Copy link
Contributor

yjshen commented Jan 21, 2022

Yes, that is true for ClickHouse. For now, our hosted ClickHouse cluster can only use one single HDFS NameNode. Lack the capability to use federated HDFS.

@akizminet
Copy link

Since Datafusion has implemented https://github.com/datafusion-contrib/datafusion-objectstore-hdfs. Does it help delta to support HDFS?

@wjones127
Copy link
Collaborator

Yes. It looks like they have complete read support. But write support isn't incomplete.

Someone could integrate that into this package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants