[SUPPORT] Should we introduce extensible simple bucket index ? #12210

TheR1sing3un · 2024-11-05T07:16:56Z

Currently we have two types of buck engines：

Simple: fixed bucket number
Consistent hash: adjust bucket num by consistent hash relation

When user build a table with simple bucket index. The amount of data in each bucket can get larger over time, affecting the performance of compaction. However, users have no way to adjust the number of buckets in real time except to delete the table and rebuild.
So do we need to provide a way to dynamically adjust the bucket count?
Bucket resizing should not block reading and writing. We can use clustering to adjust the number of buckets, and we can use a dual write for new writes during resizing execution to ensure data consistency.

Why don't we use Consistent-Hash bucket engine?

The number of buckets can only be increased or decreased one by one
It is difficult to provide a bucket join to the query engine because of the complex mapping between hash and bucket

So should we introduce extensible simple bucket? Welcome to the discussion！

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at [email protected].
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

The text was updated successfully, but these errors were encountered:

TheR1sing3un · 2024-11-05T07:17:41Z

@danny0405 Hi ! I have some idea about extensible simple bucket. Looking forward to your reply!

danny0405 · 2024-11-05T10:19:10Z

The number of buckets can only be increased or decreased one by one

I don't think this is a real problem

The number of buckets can only be increased or decreased one by one

Bucket join relies on engine specific naming of the files, which are very probably different with Hudi style, I don't think it is easy to be generalized.

TheR1sing3un · 2024-11-05T12:19:47Z

Bucket join relies on engine specific naming of the files, which are very probably different with Hudi style, I don't think it is easy to be generalized.

IMO, The query engine only needs to know the number of buckets to distribute different records to the specified partition (bucket) according to the hash value. The consistent hash bucket index cannot do this, because there is no fixed rule for the number of buckets and hash mapping relationship.

danny0405 · 2024-11-06T02:39:19Z

IMO, The query engine only needs to know the number of buckets to distribute different records to the specified partition (bucket) according to the hash value.

This is very engine specific because the "bucket" definition is varigated among different engines.

TheR1sing3un · 2024-11-06T08:51:44Z

IMO, The query engine only needs to know the number of buckets to distribute different records to the specified partition (bucket) according to the hash value.

This is very engine specific because the "bucket" definition is varigated among different engines.

Got it~

danny0405 · 2024-11-07T01:15:29Z

Also the main gains for consistent-hasing is to try to rewrite as less data files for re-hashing. Otherwise, you have to rewrie all the existing data set(the whole table) which is a cost that unaccepted for many cases.

TheR1sing3un · 2024-11-07T02:32:00Z

Also the main gains for consistent-hasing is to try to rewrite as less data files for re-hashing. Otherwise, you have to rewrie all the existing data set(the whole table) which is a cost that unaccepted for many cases.

So how can people who are using simple-bucket today deal with the increasing amount of data in their buckets? At present, it seems that only the method of deleting the table reconstruction can be solved, but the cost is relatively high, if we can support dynamically adjusting number of buckets through clustering for each partition, will it be more appropriate?

danny0405 · 2024-11-08T00:42:45Z

Another idea is to implement a light-weight index for each bucket, kind of a bit set of the record key hash codes.

ad1happy2go added the index label Nov 5, 2024

ad1happy2go assigned danny0405 Nov 5, 2024

ad1happy2go added this to Hudi Issue Support Nov 5, 2024

github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Nov 5, 2024

ad1happy2go added the feature-enquiry issue contains feature enquiries/requests or great improvement ideas label Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Should we introduce extensible simple bucket index ? #12210

[SUPPORT] Should we introduce extensible simple bucket index ? #12210

TheR1sing3un commented Nov 5, 2024

TheR1sing3un commented Nov 5, 2024

danny0405 commented Nov 5, 2024

TheR1sing3un commented Nov 5, 2024

danny0405 commented Nov 6, 2024

TheR1sing3un commented Nov 6, 2024

danny0405 commented Nov 7, 2024

TheR1sing3un commented Nov 7, 2024

danny0405 commented Nov 8, 2024

[SUPPORT] Should we introduce extensible simple bucket index ? #12210

[SUPPORT] Should we introduce extensible simple bucket index ? #12210

Comments

TheR1sing3un commented Nov 5, 2024

TheR1sing3un commented Nov 5, 2024

danny0405 commented Nov 5, 2024

TheR1sing3un commented Nov 5, 2024

danny0405 commented Nov 6, 2024

TheR1sing3un commented Nov 6, 2024

danny0405 commented Nov 7, 2024

TheR1sing3un commented Nov 7, 2024

danny0405 commented Nov 8, 2024