-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added proposal for Read-Write coordination-free bucket operation contract #700
Conversation
That helps with operations between sidecar and compactor but does not fix inter-compactor logic in case of eventual consistent storage or partial upload failures. | ||
|
||
We also partially assumed strong consistency for object storage, because [GCS has strong consistency](https://cloud.google.com/storage/docs/consistency) and S3 has as it as well, | ||
but with [some caveats](https://codeburst.io/quick-explanation-of-the-s3-consistency-model-6c9f325e3f82) that we met. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, does it have strong consistency or not? Or more importantly, do you want to say that effort has been made to such that our use of S3 is within it's strongly consistent guarantees?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, does it have strong consistency or not?
Depends how you use it (:
Or more importantly, do you want to say that effort has been made to such that our use of S3 is within it's strongly consistent guarantees?
Yes, indeed. There was an effort to eliminate invokes Get
or Exists
before writing object in crucial parts, but we DID not eliminate them everywhere, that's why some syncDelay
exists for only part of new blocks.
* What if one block is malformed. Readers cannot handle it and crashes. How repair procedure will work? We can have repair process that can download block locally, rewrite it and fix it and upload. Problem is that it will take `syncDelay` time to | ||
appear in system. Since we are blocked, we need to make the block available immediately, ignoring eventual consistency limitations. | ||
Potential solutions: | ||
* Just delete problematic block and live with 15 minute of delay for not having some portion of metrics? Will it not cause any downsampling issues? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this would cause alerts to go off that should not, I would refrain from just pulling the data...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? (:
|
||
This was rejected for we can reuse existing meta.json as "integrity" marker (e.g if after 15m we don't have meta file, it is partially uploaded and should be removed). | ||
|
||
Additional information like expected number of files can be added later if needed, however with rule no 3. we should not need that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rule3 (meta file as first to be upload) may have to be refactored to talk about som content of that file that constitutes a checksum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded above, I think checksum is not helpful here
fc41b35
to
e12b17f
Compare
|
||
We may avoid introducing additional state, by adding mitigation for not having delayed removal. | ||
|
||
For example for retention apply or manual block deletion, when we would delete block immediately we can have query failures (object does not exists for getrange operations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is retention apply
a know operation? This doesn't quite read right...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and actually the method is called like this. We can reword this.. How would you call it differently?
e12b17f
to
d775946
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, small nits.
1. A block is assumed ready to be read *only* after 15 minutes (configured by `syncDelay`) from *creation time*. Block id (`ULID`) is used | ||
to determine creation time (`if ulid.Now()-id.Time() < uint64(c.syncDelay/time.Millisecond)`). Block fresher than 15 minutes is assumed | ||
still in upload (`Fresh`). This can be overridden by `ignore_delay` field in meta.json. | ||
1. TSDB `meta.json` file within block *have to be* uploaded as first one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not be:
"Ensure meta.json
file is uploaded as the last file in the block folder."
The major difference here is compactor. Currently compactor does not allow overlap. It immediately halts compactor and waits for | ||
manual actions. This is on purpose to not allow block malformation by blocks which should be not compacted together. | ||
|
||
"The same data" overlap is crucial here. The overlap with same data (valid overlap) is when: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Overlaps with the same data" is crucial here as valid overlaps are when:"
|
||
*Newer block from two or more overlapping blocks fully submatches the source blocks of older blocks. Older blocks can be then ignored.* | ||
|
||
Fully word is crucial. For example we won't be able to resolve case with block ABCD and CDEF. This is because there is no logic for decompact or vertical compaction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The word fully is crucial"
|
||
This also means that repairing block gracefully (rewriting for deleting series or other reason that does not need to be unblocked), is as easy as uploading new block. Compactor will delete older overlapped block eventually. | ||
|
||
There is caveat for rule 2th rule (block being ready) for compactor. In case of compaction process we compact and we want to be aware of this block later on. Because of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2nd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
omg 🤦♂️
## Risks | ||
|
||
* What if someone will set too low `syncDelay`? And upload itself will take more than `syncDelay` time. Block will be assumed as `ToDelete` state and will be removed. Thanks of `deleteDelay` it might have still some time to recover.How to ensure that will not happen? | ||
* Minimum syncDelay? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
## Action Items: | ||
|
||
*Ensure our block operations are following those rules (e.g create methods that ensures this). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix formatting
## Action Items: | ||
|
||
*Ensure our block operations are following those rules (e.g create methods that ensures this). | ||
*Tests edge cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be explicit with which edge cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point - moved to TBD as we would like to define more specific action plan IMO
BTW, new TSDB code includes Parents as well!
However it only one parent, so useful only for rewrite of block, not actually for compaction. |
With those rule one could say "why to wait 15 minutes if we can just check if meta.json is parsable". In eventual consistency world | ||
we cannot just check meta.json as even when it was uploaded at the end, other files might be still unavailable for a reader. | ||
|
||
Thanks of meta.json being uploaded at the end we can form 4th rule: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I just don't understand it, but in:
- in l.93 you stated meta.json is uploaded first
- in l.101 you stated meta.json being uploaded at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should indeed be the last thing was a typo earlier in the doc, think it will be corrected soon 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, it is a typo sorry ;p Meta.json should be uploaded as last
to determine creation time (`if ulid.Now()-id.Time() < uint64(c.syncDelay/time.Millisecond)`). Block fresher than 15 minutes is assumed | ||
still in upload (`Fresh`). This can be overridden by `ignore_delay` field in meta.json. | ||
1. TSDB `meta.json` file within block *have to be* uploaded as first one. | ||
1. After 15 minutes (or time configured by `syncDelay`), if there is no fully loadable `meta.json`, we assume block is malformed, partially uploaded or marked for deletion and can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'd be careful about automatic deletion of those malformed blocks?
For example TSDB itself from what I have read in issues tries to avoid any deletion of data even malformed.
Or maybe this could be configurable if user wants to leave the block in place ignored by Thanos or to delete them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are only deleting this from blob storage, the sidecar / ruler will then notice it has a local block that has not been successfully uploaded and re upload. Therefore the deleting from block storage allows us to re upload a block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the sidecar/ruler can meanwhile delete this block and than you have nothing to re upload or this cannot happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would only happen if you have a really aggressive clean up policy in your Prometheus configuration ... based on the default sync / delete timings it would attempt to re upload in about 30mins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well we are serious on this point as well. But without meta.json this block is kind of useless, as it does not have external label.
This case happens a lot for compactor. If it is uploading and crashes in the middle it ends up with some weird partial block. Since operation of upload failed, nothing got removed, and compactor will just retry. So it is completely safe to remove such partial block.
The other writer is sidecar/ruler. In this case we need to ensure sidecar and rule will retry as well in case of upload being failed. This is done by writing to persistent file once upload is succesful. If not shipper
package will retry until success.
d775946
to
0015525
Compare
Addressed comments, Action Items TBD. I feel like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits, but think this is good.
@@ -0,0 +1,251 @@ | |||
# Read-Write coordination free operational contract for object storage | |||
|
|||
Status: draft | **in-review** | rejected | accepted | complete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**accepted**
so repair procedure might be avoided. Turning off compactor might be tricky as well in potential HA mode. Maybe we want to introduce | ||
"locking" some external labels from compactor operations. If yes, how to do it in eventual consistency system? | ||
|
||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove line
…ract. Signed-off-by: Bartek Plotka <[email protected]>
0015525
to
efc27f5
Compare
Added tasks, moved to approved dir and added date. |
Main goals are:
syncDelay
and handle eventual consistency & partial upload differently. #377syncDelay
and handle eventual consistency & partial upload differently. #377CC @improbable-ludwik