Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update FAQ to say the it is impossible to move a deltalake in S3 #1293

Open
ABRB554 opened this issue Jul 27, 2022 · 3 comments
Open

Update FAQ to say the it is impossible to move a deltalake in S3 #1293

ABRB554 opened this issue Jul 27, 2022 · 3 comments
Assignees

Comments

@ABRB554
Copy link

ABRB554 commented Jul 27, 2022

The FAQ has been updated with: Remember to copy files without changing the timestamps to ensure that the time travel with timestamps will be consistent.

This needs to include a warning that when the underlying storage is S3 it is impossible to move/create an object with the original/custom timestamp.

Or am I missing something? We are looking to move lots of data from HDFS to S3 but there is no way to preserve timestamps in this process so we cannot move out HDFS deltalakes to S3. What is more important to the business users of the data, a slight performance hit or a complete loss of timetravel?

Original post:

The JSON file contains a timestamp of the commit: {"commitInfo":{"timestamp":1579786725976,

Why not use this rather than the modified time of the file?

timestamp in the commitInfo is created before we create the json file. Using commitInfo.timestamp will make timestamps not in the same order as versions easily. In addition, it's the timestamp in the client side and which clock skew/incorrect clock time is easier to happen. Moreover, if we need to read the content of json files when trying to look for which version by the timestamp, we would need to open tons of json files. Currently, we just need to use the file listing result which is much faster.

Since we have updated the doc for this issue: https://docs.delta.io/latest/delta-faq.html#can-i-copy-my-delta-lake-table-to-another-location , I'm going to close this.

Originally posted by @zsxwing in #192 (comment)

@allisonport-db
Copy link
Collaborator

I am unclear as to what this issue is regarding. Are you asking to add a warning to the FAQ here https://docs.delta.io/latest/delta-faq.html#can-i-copy-my-delta-lake-table-to-another-location that it is not possible to do so on S3?

What is more important to the business users of the data, a slight performance hit or a complete loss of timetravel?

Or is this a feature request that we add a way for time travel by timestamp to be supported for these copied tables (when timestamps are not preserved)? You are still able to time travel by version.

@nkarpov
Copy link
Collaborator

nkarpov commented Jul 28, 2022

I realize this doesn't apply in your HDFS to S3 case @ABRB554, but as far as the suggested change in the FAQ, I think it is possible to retain the original system metadata when moving objects within S3 with replication https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html#replication-scenario

@nkarpov nkarpov self-assigned this Aug 3, 2022
@baolsen
Copy link

baolsen commented Jul 27, 2023

Expanding a bit on this issue. My use case is the same as @ABRB554.

Consider the following source code:
https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/DeltaHistoryManager.scala#L88-L90

It appears the current delta functionality is:
1 - For each version, read the log JSON action into a CommitInfo object. "*see below exception"
2 - Override the CommitInfo.timestamp that was stored in the log, to the file modification date (of the log file).

(I'm no expert so someone please double check my understanding. I've got no idea at this point how file modification date works for compacted files, which might change below discussion....)

Regarding proposed changes to this functionality, there are the following concerns I gather from this issue and previous linked issue:

Performance concern of reading the JSON file compared to listing the file status.

I believe this concern is misplaced, because as seen from the code the JSON file is anyway being read into the CommitInfo object, and then the modification date is overridden by the physical file status.

Impact of changing this core part of delta that has been around since v0.1.0 (initial release).

I believe this concern can be avoided by making any proposed change an optional change, toggle it via configuration and keep default behavior as default.

Amazon S3 is a popular storage layer and yet S3 has no way for any user to specify the modification date of a file.

Apart from S3 to S3 replication, there is no way to do this. S3 to S3 replication would replicate timestamp of an existing S3 object, which itself cannot be modified. This means for any user moving a table from any storage to S3, and needing time travel to work consistently with a timestamp, there will be a problem. Presumably, the problem is actually wider than this because any other (not just S3) storage would have at minimal a requirement for the user to update file modification dates per transaction log file, which requires custom scripting outside of the delta lake table support. Ideally, delta-lake table should have sufficient integrity that all metadata required of it to function is independent of the file system implementation.

Concern over clock skew and difference between client clocks for the commit timestamp

If we agree that using a commit timestamp instead of a file modification date, should be an optional / opt-in feature, then I would suggest that the documentation regarding enabling this feature could inform the user of this limitation and that should be sufficient. In many cases, the client clock is under some kind of organisation-wide control and a single-client is updating the table each day. So this is somewhat of an edge case concern.

Concern over difference in time between when commit timestamp is created, and when the transaction log file is last modified.

A scenario that comes to mind is: Commit is started at t0, data is written until t1 time, and then delta actually commits the transaction at t2. In this case, t0 (commit timestamp) and t2 (file modification date) may be significantly different. However a counter argument could be - what is then the value / meaning of the commit timestamp? In traditional database we might say a transaction starts at time t0 , changes are made but only committed at time t2. It is misleading to think of delta commit timestamp in the transaction log as the physical commit timestamp.

Proposal

Considering these concerns, below is a proposed way to address all listed concerns.

1 - Make sure below changes in behavior are an optional configuration, off by default.

2 - On compaction of transaction log files, record the "real" / effective commit timestamp against the commit. Meaning, at the time of compaction get the log file modification date and write it into the commit info in the compaction file. Possibly, use a new field added to CommitInfo if there is concern over changing the existing field. Otherwise if no concern then update the existing one. Regardless, lets refer to this correct value as the effectiveTimestamp in below discussion. It is equivalent to file modification date but stored permanently in the log metadata.

3 - Only override the commit timestamp with the file system timestamp on the CommitInfo object, when the file is not compacted. This avoids the client clock issues above.

The above approach might improve performance since only non-compacted files would need to check the file system modify date, compared to current implementation which would always check.

Limitations

Uncompacted files would not have the effectiveTimestamp field.

This means any user-specific tools that move a delta table from some storage to another, would need to either first force a compaction (difficult as these API are not exposed) or, more simply, modify the JSON transaction files to add the effectiveTimestamp to them. Delta could automatically use the value if it is present for uncompacted files, a small amendment to the above implementation. From what I understand, we already read the JSON file before currently overriding the timestamp.

I can't think of other limitations....

Please @allisonport-db / @zsxwing would you feedback on the proposed approach?

"*" If no CommitInfo was found, an empty one is created in DeltaHistoryManager.getCommitInfo. This won't have a commit timestamp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants