Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added proposal for Read-Write coordination-free bucket operation contract #700

Merged
merged 1 commit into from
Jan 3, 2019

Conversation

@bwplotka bwplotka requested a review from domgreen December 31, 2018 15:23
That helps with operations between sidecar and compactor but does not fix inter-compactor logic in case of eventual consistent storage or partial upload failures.

We also partially assumed strong consistency for object storage, because [GCS has strong consistency](https://cloud.google.com/storage/docs/consistency) and S3 has as it as well,
but with [some caveats](https://codeburst.io/quick-explanation-of-the-s3-consistency-model-6c9f325e3f82) that we met.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, does it have strong consistency or not? Or more importantly, do you want to say that effort has been made to such that our use of S3 is within it's strongly consistent guarantees?

Copy link
Member Author

@bwplotka bwplotka Dec 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, does it have strong consistency or not?

Depends how you use it (:

Or more importantly, do you want to say that effort has been made to such that our use of S3 is within it's strongly consistent guarantees?

Yes, indeed. There was an effort to eliminate invokes Get or Exists before writing object in crucial parts, but we DID not eliminate them everywhere, that's why some syncDelay exists for only part of new blocks.

* What if one block is malformed. Readers cannot handle it and crashes. How repair procedure will work? We can have repair process that can download block locally, rewrite it and fix it and upload. Problem is that it will take `syncDelay` time to
appear in system. Since we are blocked, we need to make the block available immediately, ignoring eventual consistency limitations.
Potential solutions:
* Just delete problematic block and live with 15 minute of delay for not having some portion of metrics? Will it not cause any downsampling issues?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this would cause alerts to go off that should not, I would refrain from just pulling the data...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? (:


This was rejected for we can reuse existing meta.json as "integrity" marker (e.g if after 15m we don't have meta file, it is partially uploaded and should be removed).

Additional information like expected number of files can be added later if needed, however with rule no 3. we should not need that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rule3 (meta file as first to be upload) may have to be refactored to talk about som content of that file that constitutes a checksum.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded above, I think checksum is not helpful here

@bwplotka bwplotka force-pushed the proposal-compactor branch 2 times, most recently from fc41b35 to e12b17f Compare December 31, 2018 16:34

We may avoid introducing additional state, by adding mitigation for not having delayed removal.

For example for retention apply or manual block deletion, when we would delete block immediately we can have query failures (object does not exists for getrange operations)
Copy link
Contributor

@improbable-ludwik improbable-ludwik Dec 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is retention apply a know operation? This doesn't quite read right...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and actually the method is called like this. We can reword this.. How would you call it differently?

Copy link
Contributor

@domgreen domgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, small nits.

docs/proposals/read-write-operations-bucket.md Outdated Show resolved Hide resolved
1. A block is assumed ready to be read *only* after 15 minutes (configured by `syncDelay`) from *creation time*. Block id (`ULID`) is used
to determine creation time (`if ulid.Now()-id.Time() < uint64(c.syncDelay/time.Millisecond)`). Block fresher than 15 minutes is assumed
still in upload (`Fresh`). This can be overridden by `ignore_delay` field in meta.json.
1. TSDB `meta.json` file within block *have to be* uploaded as first one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be:
"Ensure meta.json file is uploaded as the last file in the block folder."

docs/proposals/read-write-operations-bucket.md Outdated Show resolved Hide resolved
The major difference here is compactor. Currently compactor does not allow overlap. It immediately halts compactor and waits for
manual actions. This is on purpose to not allow block malformation by blocks which should be not compacted together.

"The same data" overlap is crucial here. The overlap with same data (valid overlap) is when:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Overlaps with the same data" is crucial here as valid overlaps are when:"


*Newer block from two or more overlapping blocks fully submatches the source blocks of older blocks. Older blocks can be then ignored.*

Fully word is crucial. For example we won't be able to resolve case with block ABCD and CDEF. This is because there is no logic for decompact or vertical compaction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The word fully is crucial"


This also means that repairing block gracefully (rewriting for deleting series or other reason that does not need to be unblocked), is as easy as uploading new block. Compactor will delete older overlapped block eventually.

There is caveat for rule 2th rule (block being ready) for compactor. In case of compaction process we compact and we want to be aware of this block later on. Because of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2nd

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omg 🤦‍♂️

## Risks

* What if someone will set too low `syncDelay`? And upload itself will take more than `syncDelay` time. Block will be assumed as `ToDelete` state and will be removed. Thanks of `deleteDelay` it might have still some time to recover.How to ensure that will not happen?
* Minimum syncDelay?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


## Action Items:

*Ensure our block operations are following those rules (e.g create methods that ensures this).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix formatting

## Action Items:

*Ensure our block operations are following those rules (e.g create methods that ensures this).
*Tests edge cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be explicit with which edge cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point - moved to TBD as we would like to define more specific action plan IMO

@bwplotka
Copy link
Member Author

bwplotka commented Jan 2, 2019

BTW, new TSDB code includes Parents as well!

func (c *LeveledCompactor) Write(dest string, b BlockReader, mint, maxt int64, parent *BlockMeta) (ulid.ULID, error) {
	entropy := rand.New(rand.NewSource(time.Now().UnixNano()))
	uid := ulid.MustNew(ulid.Now(), entropy)

	meta := &BlockMeta{
		ULID:    uid,
		MinTime: mint,
		MaxTime: maxt,
	}
	meta.Compaction.Level = 1
	meta.Compaction.Sources = []ulid.ULID{uid}

	if parent != nil {
		meta.Compaction.Parents = []BlockDesc{
			{ULID: parent.ULID, MinTime: parent.MinTime, MaxTime: parent.MaxTime},
		}
	}

	err := c.write(dest, meta, b)
	if err != nil {
		return uid, err
	}

	level.Info(c.logger).Log("msg", "write block", "mint", meta.MinTime, "maxt", meta.MaxTime, "ulid", meta.ULID)
	return uid, nil
}

However it only one parent, so useful only for rewrite of block, not actually for compaction.

With those rule one could say "why to wait 15 minutes if we can just check if meta.json is parsable". In eventual consistency world
we cannot just check meta.json as even when it was uploaded at the end, other files might be still unavailable for a reader.

Thanks of meta.json being uploaded at the end we can form 4th rule:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I just don't understand it, but in:

  • in l.93 you stated meta.json is uploaded first
  • in l.101 you stated meta.json being uploaded at the end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should indeed be the last thing was a typo earlier in the doc, think it will be corrected soon 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it is a typo sorry ;p Meta.json should be uploaded as last

docs/proposals/read-write-operations-bucket.md Outdated Show resolved Hide resolved
to determine creation time (`if ulid.Now()-id.Time() < uint64(c.syncDelay/time.Millisecond)`). Block fresher than 15 minutes is assumed
still in upload (`Fresh`). This can be overridden by `ignore_delay` field in meta.json.
1. TSDB `meta.json` file within block *have to be* uploaded as first one.
1. After 15 minutes (or time configured by `syncDelay`), if there is no fully loadable `meta.json`, we assume block is malformed, partially uploaded or marked for deletion and can be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'd be careful about automatic deletion of those malformed blocks?
For example TSDB itself from what I have read in issues tries to avoid any deletion of data even malformed.

Or maybe this could be configurable if user wants to leave the block in place ignored by Thanos or to delete them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only deleting this from blob storage, the sidecar / ruler will then notice it has a local block that has not been successfully uploaded and re upload. Therefore the deleting from block storage allows us to re upload a block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the sidecar/ruler can meanwhile delete this block and than you have nothing to re upload or this cannot happen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would only happen if you have a really aggressive clean up policy in your Prometheus configuration ... based on the default sync / delete timings it would attempt to re upload in about 30mins.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we are serious on this point as well. But without meta.json this block is kind of useless, as it does not have external label.

This case happens a lot for compactor. If it is uploading and crashes in the middle it ends up with some weird partial block. Since operation of upload failed, nothing got removed, and compactor will just retry. So it is completely safe to remove such partial block.

The other writer is sidecar/ruler. In this case we need to ensure sidecar and rule will retry as well in case of upload being failed. This is done by writing to persistent file once upload is succesful. If not shipper package will retry until success.

@bwplotka bwplotka force-pushed the proposal-compactor branch from d775946 to 0015525 Compare January 3, 2019 13:20
@bwplotka
Copy link
Member Author

bwplotka commented Jan 3, 2019

Addressed comments, Action Items TBD.

I feel like deleteSync is crucial here. See first risk in #Risk section.

Copy link
Contributor

@domgreen domgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits, but think this is good.

@@ -0,0 +1,251 @@
# Read-Write coordination free operational contract for object storage

Status: draft | **in-review** | rejected | accepted | complete
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

**accepted**

so repair procedure might be avoided. Turning off compactor might be tricky as well in potential HA mode. Maybe we want to introduce
"locking" some external labels from compactor operations. If yes, how to do it in eventual consistency system?

*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove line

@bwplotka bwplotka force-pushed the proposal-compactor branch from 0015525 to efc27f5 Compare January 3, 2019 16:25
@bwplotka
Copy link
Member Author

bwplotka commented Jan 3, 2019

Added tasks, moved to approved dir and added date.

@bwplotka bwplotka merged commit 7e4f0da into master Jan 3, 2019
@bwplotka bwplotka deleted the proposal-compactor branch January 3, 2019 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants