Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support of HDFS as remote object store #1062

Closed
wants to merge 3 commits into from
Closed

Add support of HDFS as remote object store #1062

wants to merge 3 commits into from

Conversation

yahoNanJing
Copy link
Contributor

Which issue does this PR close?

Closes #1060.

Rationale for this change

Currently, we can only read parquet files from local file system. It would be nice to add support to read parquet files that reside on HDFS.

What changes are included in this PR?

Introduce LocalParquetFileReader to get parquet ChunkReader for reading parquet files from local file system.
Introduce HadoopParquetFileReader to get parquet ChunkReader for reading parquet files from HDFS.

Are there any user-facing changes?

Users can register a parquet table with a HDFS uri, like "hdfs://localhost:9000/tmp/alltypes_plain.parquet".

No.

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Sep 28, 2021
@houqp houqp requested a review from alamb September 29, 2021 03:48
@@ -75,8 +83,15 @@ pub type ListEntryStream =
/// It maps strings (e.g. URLs, filesystem paths, etc) to sources of bytes
#[async_trait]
pub trait ObjectStore: Sync + Send + Debug {
/// Get file system scheme
fn get_schema(&self) -> &'static str;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an object store could have multiple schemes, for example, s3/s3a or file/fs/filesystem, so it would be better to return a slice of str here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also i think the name should be get_scheme instead?

Copy link
Member

@houqp houqp Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... after taking a closer look at this, it looks like this is mainly used in get_chunk_reader to build object store specific chunkreaders based on the file scheme. I think the ideal abstraction would be to make file format modules agnostic to object stores instead of implementing object store specific format readers like HadoopParquetFileReader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ideal abstraction would be to make file format modules agnostic to object stores instead of implementing object store specific format readers like

I agree. @rdettai is heading in this direction in #1010

@rdettai
Copy link
Contributor

rdettai commented Oct 6, 2021

waouh, interesting work, thanks! 😃 I am wondering the order in which we should proceed. Shouldn't we make the providers/execution_plans use the object store abstraction first, and then add new store implementations? Or the other way around? I see that both rdettai#1 and this PR are both making major changes to /object_store/mod.rs 😄 smells like annoying conflicts !

@alamb
Copy link
Contributor

alamb commented Oct 6, 2021

Shouldn't we make the providers/execution_plans use the object store abstraction first, and then add new store implementations?

I don't think there either order is better/worse than the other. The real challenge is like @rdettai says when the interfaces are changing in multiple PRs resulting in conflicts...

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @yahoNanJing -- this is a great start. I haven't reviewed the code carefully, but the basic idea looks wonderful to me

How do you think we should proceed with parallel implementations with #1010 ? It seems like the actual hdfs bindings / object store implementation for hdfs are valuable, but the changes to the parquet reader are likely going to conflict

@@ -46,6 +46,8 @@ unicode_expressions = ["unicode-segmentation"]
force_hash_collisions = []
# Used to enable the avro format
avro = ["avro-rs", "num-traits"]
# Used to enable hdfs as remote object store
hdfs = ["fs-hdfs"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// specific language governing permissions and limitations
// under the License.

//! It's a utility for testing with data source in hdfs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 very nice for testing

Comment on lines +111 to +114
lazy_static! {
static ref OBJECT_STORES: Box<ObjectStoreRegistry> =
Box::new(ObjectStoreRegistry::new());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #1072 the consensus seems to be that we should avoid using static for the registry.


/// Parquet file reader for hdfs object store, by which we can get ``DynChunkReader``
#[derive(Clone)]
pub struct HadoopParquetFileReader {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(declaration: I am a newbie to this project, I may ask dumb questions)

Isn't the object store one layer lower than the file format? it should be transparent to a particular format, right?
Otherwise if we have N (local, hdfs, s3) types of object stores and M types of file format (csv, parquet, json), we will have N*M concrete implementations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, our current implementation of file formats work with a generic objectstore trait object, for example: https://github.com/apache/arrow-datafusion/blob/831e07debc4f136f2e47e126b20e441f7606bd74/datafusion/src/datasource/file_format/csv.rs#L98

Copy link

@coderplay coderplay Oct 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@houqp Then why do we need this HadoopParquetFileReader, plus the below LocalParquetFileReader ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shouldn't have it, see my comment in #1062 (comment) :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @yahoNanJing -- this is a great start. I haven't reviewed the code carefully, but the basic idea looks wonderful to me

How do you think we should proceed with parallel implementations with #1010 ? It seems like the actual hdfs bindings / object store implementation for hdfs are valuable, but the changes to the parquet reader are likely going to conflict

Thanks @alamb and @houqp for your comments. I'll do a refactoring based on the latest code merged with #1010. Sorry for the late reply since I'm asked to deal with other things😂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally understand 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alamb, I've done the refactoring. Where should I create the PR for? Directly override the PR here or create another one? And as the discussion of #907, a more preferable way for the community may be to put connectors such as S3, HDFS in their own repositories for fast development iterations. Should I create another repository for this HDFS support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @yahoNanJing -- if you have time here is what I recommend:

  1. Make a draft PR to this (arrow-datafusion) repo (so the conversation can include the code)
  2. Leave a note in the PR's description that you are looking for feedback about where to put the connector (this repo or a separate one)
  3. Then send a note to the [email protected] mailing list (or I can do this too) with a reference to the PR asking if anyone has feedback.

I have my own opinions on this matter, but I think we should get broader input before making a decision

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alamb, a new PR #1223 is created as a refactor of this PR.

Actually, in my option, it's better for DataFusion at least to support one remote object store by default in its own repository. In many companies around me, they still use HDFS. Therefore, it would be good to add HDFS support by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And now the interfaces of the object store is still not stable. If one remote object store implementation is added, it will also be helpful for the interface refactoring.

@alamb
Copy link
Contributor

alamb commented Dec 29, 2021

I am trying to clean up the list of PRs to review in DataFusion so marking old ones as stale. Please let us know if you plan to work on this soon. Otherwise we will close it down and reopen it when the time is right.

@alamb alamb added the stale-pr label Dec 29, 2021
@alamb alamb mentioned this pull request Jan 4, 2022
@alamb
Copy link
Contributor

alamb commented Jan 18, 2022

Closing stale PRs, please reopen if this was a mistake and you plan to keep working on this one

@alamb alamb closed this Jan 18, 2022
@alamb
Copy link
Contributor

alamb commented Jan 18, 2022

I think development has moved to https://github.com/datafusion-contrib/datafusion-hdfs-native

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support of HDFS as remote object store
6 participants