Add support of HDFS as remote object store #1062

yahoNanJing · 2021-09-28T05:28:11Z

Which issue does this PR close?

Closes #1060.

Rationale for this change

Currently, we can only read parquet files from local file system. It would be nice to add support to read parquet files that reside on HDFS.

What changes are included in this PR?

Introduce LocalParquetFileReader to get parquet ChunkReader for reading parquet files from local file system.
Introduce HadoopParquetFileReader to get parquet ChunkReader for reading parquet files from HDFS.

Are there any user-facing changes?

Users can register a parquet table with a HDFS uri, like "hdfs://localhost:9000/tmp/alltypes_plain.parquet".

No.

…or reading parquet files from local file system

…for reading parquet files from HDFS.

houqp · 2021-09-29T03:52:05Z

datafusion/src/datasource/object_store/mod.rs

@@ -75,8 +83,15 @@ pub type ListEntryStream =
 /// It maps strings (e.g. URLs, filesystem paths, etc) to sources of bytes
 #[async_trait]
 pub trait ObjectStore: Sync + Send + Debug {
+    /// Get file system scheme
+    fn get_schema(&self) -> &'static str;


an object store could have multiple schemes, for example, s3/s3a or file/fs/filesystem, so it would be better to return a slice of str here.

also i think the name should be get_scheme instead?

hmm... after taking a closer look at this, it looks like this is mainly used in get_chunk_reader to build object store specific chunkreaders based on the file scheme. I think the ideal abstraction would be to make file format modules agnostic to object stores instead of implementing object store specific format readers like HadoopParquetFileReader.

I think the ideal abstraction would be to make file format modules agnostic to object stores instead of implementing object store specific format readers like

I agree. @rdettai is heading in this direction in #1010

rdettai · 2021-10-06T07:46:04Z

waouh, interesting work, thanks! 😃 I am wondering the order in which we should proceed. Shouldn't we make the providers/execution_plans use the object store abstraction first, and then add new store implementations? Or the other way around? I see that both rdettai#1 and this PR are both making major changes to /object_store/mod.rs 😄 smells like annoying conflicts !

alamb · 2021-10-06T10:34:12Z

Shouldn't we make the providers/execution_plans use the object store abstraction first, and then add new store implementations?

I don't think there either order is better/worse than the other. The real challenge is like @rdettai says when the interfaces are changing in multiple PRs resulting in conflicts...

alamb

Thank you for this PR @yahoNanJing -- this is a great start. I haven't reviewed the code carefully, but the basic idea looks wonderful to me

How do you think we should proceed with parallel implementations with #1010 ? It seems like the actual hdfs bindings / object store implementation for hdfs are valuable, but the changes to the parquet reader are likely going to conflict

alamb · 2021-10-06T10:34:39Z

datafusion/Cargo.toml

@@ -46,6 +46,8 @@ unicode_expressions = ["unicode-segmentation"]
 force_hash_collisions = []
 # Used to enable the avro format
 avro = ["avro-rs", "num-traits"]
+# Used to enable hdfs as remote object store
+hdfs = ["fs-hdfs"]


alamb · 2021-10-06T10:36:27Z

datafusion/src/test_util/hdfs.rs

+// specific language governing permissions and limitations
+// under the License.
+
+//! It's a utility for testing with data source in hdfs.


👍 very nice for testing

rdettai · 2021-10-08T08:44:39Z

datafusion/src/datasource/object_store/mod.rs

+lazy_static! {
+    static ref OBJECT_STORES: Box<ObjectStoreRegistry> =
+        Box::new(ObjectStoreRegistry::new());
+}


In #1072 the consensus seems to be that we should avoid using static for the registry.

coderplay · 2021-10-16T18:15:28Z

datafusion/src/datasource/object_store/hdfs/os_parquet.rs

+
+/// Parquet file reader for hdfs object store, by which we can get ``DynChunkReader``
+#[derive(Clone)]
+pub struct HadoopParquetFileReader {


(declaration: I am a newbie to this project, I may ask dumb questions)

Isn't the object store one layer lower than the file format? it should be transparent to a particular format, right?
Otherwise if we have N (local, hdfs, s3) types of object stores and M types of file format (csv, parquet, json), we will have N*M concrete implementations.

yes, our current implementation of file formats work with a generic objectstore trait object, for example: https://github.com/apache/arrow-datafusion/blob/831e07debc4f136f2e47e126b20e441f7606bd74/datafusion/src/datasource/file_format/csv.rs#L98

@houqp Then why do we need this HadoopParquetFileReader, plus the below LocalParquetFileReader ?

I shouldn't have it, see my comment in #1062 (comment) :)

Thank you for this PR @yahoNanJing -- this is a great start. I haven't reviewed the code carefully, but the basic idea looks wonderful to me

How do you think we should proceed with parallel implementations with #1010 ? It seems like the actual hdfs bindings / object store implementation for hdfs are valuable, but the changes to the parquet reader are likely going to conflict

Thanks @alamb and @houqp for your comments. I'll do a refactoring based on the latest code merged with #1010. Sorry for the late reply since I'm asked to deal with other things😂

I totally understand 👍

Hi @alamb, I've done the refactoring. Where should I create the PR for? Directly override the PR here or create another one? And as the discussion of #907, a more preferable way for the community may be to put connectors such as S3, HDFS in their own repositories for fast development iterations. Should I create another repository for this HDFS support.

HI @yahoNanJing -- if you have time here is what I recommend:

Make a draft PR to this (arrow-datafusion) repo (so the conversation can include the code)

Leave a note in the PR's description that you are looking for feedback about where to put the connector (this repo or a separate one)

Then send a note to the [email protected] mailing list (or I can do this too) with a reference to the PR asking if anyone has feedback.

I have my own opinions on this matter, but I think we should get broader input before making a decision

Hi @alamb, a new PR #1223 is created as a refactor of this PR.

Actually, in my option, it's better for DataFusion at least to support one remote object store by default in its own repository. In many companies around me, they still use HDFS. Therefore, it would be good to add HDFS support by default.

And now the interfaces of the object store is still not stable. If one remote object store implementation is added, it will also be helpful for the interface refactoring.

alamb · 2021-12-29T12:41:14Z

I am trying to clean up the list of PRs to review in DataFusion so marking old ones as stale. Please let us know if you plan to work on this soon. Otherwise we will close it down and reopen it when the time is right.

alamb · 2022-01-18T17:54:40Z

Closing stale PRs, please reopen if this was a mistake and you plan to keep working on this one

alamb · 2022-01-18T17:54:57Z

I think development has moved to https://github.com/datafusion-contrib/datafusion-hdfs-native

kyotoYaho added 3 commits September 28, 2021 12:01

(#1060) Introduce LocalParquetFileReader to get parquet ChunkReader f…

d123ff7

…or reading parquet files from local file system

(#1060) Introduce HadoopParquetFileReader to get parquet ChunkReader …

6358d38

…for reading parquet files from HDFS.

(#1060) Add unit test for querying on parquet data on hdfs

c416a00

github-actions bot added the datafusion Changes in the datafusion crate label Sep 28, 2021

houqp requested a review from alamb September 29, 2021 03:48

houqp reviewed Sep 29, 2021

View reviewed changes

alamb reviewed Oct 6, 2021

View reviewed changes

rdettai mentioned this pull request Oct 6, 2021

Expose a static object store registry #1072

Closed

rdettai reviewed Oct 8, 2021

View reviewed changes

coderplay reviewed Oct 16, 2021

View reviewed changes

yahoNanJing mentioned this pull request Nov 3, 2021

Add support of HDFS as remote object store #1223

Closed

alamb added the stale-pr label Dec 29, 2021

alamb mentioned this pull request Jan 4, 2022

S3 Support #907

Closed

alamb closed this Jan 18, 2022

drauschenbach mentioned this pull request Nov 3, 2024

Add support of HDFS as remote object store #1060

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support of HDFS as remote object store #1062

Add support of HDFS as remote object store #1062

yahoNanJing commented Sep 28, 2021

houqp Sep 29, 2021

houqp Sep 29, 2021

houqp Sep 29, 2021 •

edited

Loading

alamb Sep 29, 2021

rdettai commented Oct 6, 2021

alamb commented Oct 6, 2021

alamb left a comment

alamb Oct 6, 2021

alamb Oct 6, 2021

rdettai Oct 8, 2021

coderplay Oct 16, 2021

houqp Oct 16, 2021

coderplay Oct 16, 2021 •

edited

Loading

houqp Oct 16, 2021

yahoNanJing Oct 21, 2021

alamb Oct 23, 2021

yahoNanJing Nov 1, 2021

alamb Nov 2, 2021

yahoNanJing Nov 3, 2021

yahoNanJing Nov 3, 2021

alamb commented Dec 29, 2021

alamb commented Jan 18, 2022

alamb commented Jan 18, 2022 •

edited

Loading

Add support of HDFS as remote object store #1062

Add support of HDFS as remote object store #1062

Conversation

yahoNanJing commented Sep 28, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

houqp Sep 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdettai commented Oct 6, 2021

alamb commented Oct 6, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderplay Oct 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 29, 2021

alamb commented Jan 18, 2022

alamb commented Jan 18, 2022 • edited Loading

houqp Sep 29, 2021 •

edited

Loading

coderplay Oct 16, 2021 •

edited

Loading

alamb commented Jan 18, 2022 •

edited

Loading