Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8550] Make Hudi 1.x write timeline to a dedicated timeline folder under .hoodie #12288

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

bvaradar
Copy link
Contributor

Change Logs

Hudi 1.x to use dedicated timeline folder under .hoodie

Impact

Timeline folder to move to a dedicated folder.

Risk level (write none, low medium or high below)

none

Documentation Update

none

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Nov 18, 2024
if (needsUpgradeOrDowngrade(metaClient)) {
// unclear what instant to use, since upgrade does have a given instant.
executeUsingTxnManager(Option.empty(), () -> tryUpgrade(metaClient, Option.empty()));
updatedMetaClient = createMetaClient(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We better update the options within tryUpgrade, like what we already do: reloadTableConfig, reloadActiveTimeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metaClient is locally passed to this method which in turn needs to be passed to startCommit. On upgrade, metaclient is stale (wrong layoutVetsion) and needs to be updated. This is the reason for the above change.

Copy link
Contributor

@danny0405 danny0405 Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just refresh the timelineLayoutVersion and timelineLayout from the table config right after the table config are reloaded, so the refresh sequence become:

  1. refresh table config
  2. refresh timeline layout
  3. refresh timeline

Or we just remove the reload in tryUpgrade which is much simpler, and we need to take care all the calls of tryUpgrade.

@bvaradar
Copy link
Contributor Author

@danny0405 : Apart from the two comments, Are you ok with other changes ?

@danny0405
Copy link
Contributor

@danny0405 : Apart from the two comments, Are you ok with other changes ?

Yes, overall looks good to me, I reviewed EightToSevenDowngradeHandler in detail and it should be good, the other changes are minor besides that meta client refresh change.

@vinothchandar
Copy link
Member

@bvaradar Have not looked very closely. but can we also make sure the lsm timeline moves along with new folder.. Call it history instead of archived?

so

.hoodie 
   |______timeline <-- active lives here
                       |________ history <-- lsm lives here.

over 1.1 or 1.2 I want to fully converge it to a single timeline.

@codope codope force-pushed the HUDI-8550-new_timeline_folder branch from 5a9c25f to a90e9ea Compare November 22, 2024 06:38
@@ -711,7 +711,7 @@ class TestSpark3DDL extends HoodieSparkSqlTestBase {

test("Test schema auto evolution") {
withTempDir { tmp =>
Seq("COPY_ON_WRITE", "MERGE_ON_READ").foreach { tableType =>
Seq("COPY_ON_WRITE").foreach { tableType =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to revisit this after fixing all other tests. For MOR, this test fails due to

Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file file:/private/var/folders/s5/pqxf5ndx12qg6h0zgl2d9zxh0000gn/T/spark-83fab01d-7af8-4c22-9dee-8d840aa02e90/h1/americas/brazil/sao_paulo/c7c9ab23-56f7-45f4-bdbe-d7a8de9671bf-0_0-22-35_20241122094757341.parquet
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:264)
	at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:210)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetRowIndexUtil$RecordReaderWithRowIndexes.nextKeyValue(ParquetRowIndexUtil.scala:89)
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
	at org.apache.spark.sql.execution.datasources.RecordReaderIterator$$anon$1.hasNext(RecordReaderIterator.scala:61)
	at org.apache.hudi.util.CloseableInternalRowIterator.hasNext(CloseableInternalRowIterator.scala:50)
	at org.apache.hudi.common.table.read.HoodieKeyBasedFileGroupRecordBuffer.doHasNext(HoodieKeyBasedFileGroupRecordBuffer.java:134)
	at org.apache.hudi.common.table.read.HoodieBaseFileGroupRecordBuffer.hasNext(HoodieBaseFileGroupRecordBuffer.java:149)
	at org.apache.hudi.common.table.read.HoodieFileGroupReader.hasNext(HoodieFileGroupReader.java:235)
	at org.apache.hudi.common.table.read.HoodieFileGroupReader$HoodieFileGroupReaderIterator.hasNext(HoodieFileGroupReader.java:289)
	at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat$$anon$1.hasNext(HoodieFileGroupReaderBasedParquetFileFormat.scala:273)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:283)
	... 22 more
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableDouble
	at org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.setDouble(SpecificInternalRow.scala:284)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setDouble(ParquetRowConverter.scala:185)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addDouble(ParquetRowConverter.scala:96)
	at org.apache.parquet.column.impl.ColumnReaderBase$2$2.writeValue(ColumnReaderBase.java:269)
	at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:440)
	at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
	at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:234)
	... 34 more

@github-actions github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Nov 22, 2024
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
migration size:XL PR with lines of changes > 1000 version-compatibility
Projects
Status: 🆕 New
Development

Successfully merging this pull request may close these issues.

6 participants