Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TTL support for streams #23

Closed
DaveAurionix opened this issue Oct 26, 2017 · 18 comments
Closed

TTL support for streams #23

DaveAurionix opened this issue Oct 26, 2017 · 18 comments
Assignees

Comments

@DaveAurionix
Copy link

We have a need to delete streams after a time period that is agreed with the business. The driver is to cap CosmosDB costs by reducing the storage capacity requirement. As a result, some less-used services can change to single-partition collections, which significantly reduces their contribution to the total RU cost. Is it worth a discussion about a proposed approach, following which we could put a PR together?

@asosMikeGore
Copy link

I think this would be good to have in. Just want to check a few things

  • Do you need the deletions to show up in projections? Deletes don't show up in the change feed, so just wanted to check if there's a need to add a soft delete along with a hard delete.
  • Do you need to set the TTL at stream creation or a later date? Thinking allowing both would be handy.
  • Do you derive no business value from these streams? One thing to bear in mind is that if the deletion is significant from a business perspective, you'd lose the capability to project on those deletions with a full delete.
  • Would adding in a compression capability help with this/be an alternative solution? As the event body & metadata is not indexed we could add this in pretty easily.

Happy to accept a PR if you have something ready?

@DaveAurionix
Copy link
Author

  1. We have no current requirement for a soft delete that shows up in projections. The whole stream silently vanishing would be fine.
  2. Good question. I was picturing setting the TTL at stream creation but I wonder if we'd eventually wish for a feature that allows a stream to be deleted X days past the last event written to it (i.e. a sliding TTL). We have no current requirement for that latter idea.
  3. The business can't currently understand/liberate the value present in those streams. It's more that the flavour of ice-cream preferred this month is that of reduced Azure hosting costs. As we continue to grow, the question of releasing better business intelligence from the event streams will come up, at which point the value is understood and that in turn releases more budget for the storage and we would stop deleting some. We won't be asked why we didn't keep event streams from before a reporting requirement was known, but we would be asked why we continued to pay for storage of data that we're not using right now. Small steps :). If the event resulting in a deletion itself had business value, I think we're more likely to think about a deferred message approach and not rely on storage-layer TTLs (e.g. if something "expires" we're likely to use a deferred message to raise an event that we react to, possibly not deleting the stream at that precise point). It's worth adding that we're not planning to use TTL on key services, only on some smaller less relevant services geared around short-lived interactions.
  4. Wouldn't object to compression but TTL has more value for us at the moment as it is a fixed/determinable cap, whereas compression simply delays the inevitable. Would help, but not a priority for us at the moment.

I don't have a PR, I thought it best to discuss an approach first (and requirements - there are useful questions here). (and my better half just got back home, so I don't have a starter-for-ten idea!)

@asosMikeGore
Copy link

Sounds like the immediate requirement then is support for a hard delete. To be fair a soft delete could be represented as an event anyway. As we don't currently have the concept of a stream specific metadata document that could contain info about whether the stream as a whole is deleted, this is likely the simplest way to handle it if anyone wants that.

Compression would be a good cost saving measure as it should help to reduce the RUs required simply due to needing a set amount per partition.

@DaveAurionix
Copy link
Author

Here's a challenge: we're doing this to try to fit micro-services into one DocumentDB partition each (10GB storage cap). We're doing that to reduce the RU/s cost below the 2500 minimum for scalable storage capacity. If we add a streams collection to help with the TTL concept then we have an RU overhead. It would be small though, so should still save RUs overall. I was trying to look into whether a cunning index on the commits collection could provide the information about expired streams without needing a new dedicated streams collection, but I didn't get very far.

@asosMikeGore asosMikeGore self-assigned this Nov 7, 2017
@asosMikeGore
Copy link

Just started to work on some ideas here on a branch. If all we need to do is ensure that old events are deleted, then setting the collection level TTL should do the job. I've compared that to Event Store's TTL and it looks from the docs to work the same way.

This reduces complexity of setting the stream metadata. We can tackle that if a strong requirement to have stream specific TTLs comes up and then look at efficient ways to reduce any RU impacts for this.

Does that work for you?

@DaveAurionix
Copy link
Author

DaveAurionix commented Nov 7, 2017

If I've understood you right ... I don't think collection level TTL will work (on the commits collection anyway) - it will partially delete streams (i.e. delete older events but leave newer events in place). We need to either delete all events in a stream, or none of them - I think.

I had started to think too but this is very very draft as a different option:

  • A lazily updated index in CosmosDB based on streamid and persistence timestamp
  • Task DeleteExpiredStreams method on IStorageEngine (or storage-independent generalisation: Task PerformMaintenance?)
  • Queries for distinct streamid where timestamp > xx (this will query over commits but the distinct means only streamids will be returned)
  • Deletes any streams found using new deletestream stored procedure
  • Host process's responsibility to call PerformMaintenance from time to time (background thread?)

I might be wrong to be trying to avoid a "streams" collection, I haven't fed that thinking back into the points above yet. It might replace the index.

@asosMikeGore
Copy link

Why isn't partial deletions acceptable? Would you still be trying to read any of the older streams beyond a certain point in time?

The simple way to avoid a streams collection is to have a "stream" document in the commits collection. That way you can apply changes in a transactionally consistent manner. However, this is engineering effort that if we can avoid I'd rather do so. What I'm proposing looks to be how this would work in Event Store too based on:
https://eventstore.org/docs/dotnet-api/4.0.0/stream-metadata/

@DaveAurionix
Copy link
Author

Interesting thought that. I've realised the event store may not be used in an event sourcing context so I might be contaminating a pure store (SES) with a concern of a layer built on top of it (Event Sourcing).

I was thinking that partial deletions would be OK as long as they don't surface as partial event streams to callers - i.e. as long as it stays an implementation-level detail. If a caller innocently requests a certain event stream, I had thought that they should receive the whole stream or none of it. The event objects should exist for the same period of time that the final application state they represent exists for. If, for example, the context is a DDD aggregate and it is replaying events to rebuild state, it would get in a horrible mess because it would only rebuild state partially. A decision it had earlier reached might be undone and different action taken, simply because some events had gone missing from its stream. I had thought that a stream should be append-only - I admit we're adding an exception by allowing the stream to be deleted, but at least we're not altering or corrupting the events inside a stream that still exists. If we were doing that, we'd probably need some sort of snapshot event capability before earlier events are deleted.

Adding a stream document in at the start is a neat solution - the TTL index would always cause that to be deleted with the first batch of events to be delete. Potentially a stored procedure could be introduced to support reading from the stream, and that stored procedure could ensure that the stream document for the stream exists. If it doesn't, then the sproc returns no documents, even if a partial stream exists. That same principle could probably work without the stream document too - if the sproc makes sure that document with id :1 always exists, for example.

@DaveAurionix
Copy link
Author

BTW .. I think I'd be reluctant to leave a stream metadata document behind (e.g. having IsStreamDeleted=true). I know it's a tiny amount of storage, but I like the concept that the TTL literally caps storage needs - it completely removes the concept of storage consumption that just keeps growing. Still thinking about that though.

@DaveAurionix
Copy link
Author

DaveAurionix commented Nov 7, 2017

So ... summarising my current feeling - I think that if we have no need to reflect back on deleted streams' metadata, I'm currently in favour of the collection-level TTL that you mention, and a read sproc (or C# code) that returns no events to the caller if the first event has been deleted.

That allows partial deletes inside the storage engine, but doesn't surface them to the consumer.

@DaveAurionix
Copy link
Author

DaveAurionix commented Nov 7, 2017

Do you have more of a need for the Stream document, and for consumers to check the IsStreamDeleted property before working with a stream's contents?

Pants, I never answered one of your questions: no, we shouldn't be requesting streams past the TTL point, but if some accident happens and we do request the stream, I'd like SES to not return it after it has expired, rather than application code trying to process a stream with early events missing. It might look on the surface as if SES has a bug if events quietly start disappearing from streams too perhaps, whereas it's more obvious when the whole stream goes that the "stream TTL" is responsible.

@asosMikeGore
Copy link

If you commit to using a TTL isn't it an intentional decision if events disappear? In terms of soft delete, we could represent this via a doc, but the simplest way to handle this would just be to put a "deleted" event onto the stream, that way you can process this via the change feed.

Pretty sure I've read of approaches before where people have soft deleted for X period of time until the events/streams aren't used and then hard delete them.

@DaveAurionix
Copy link
Author

DaveAurionix commented Nov 7, 2017

At API level, I'd like to commit to a TTL on a stream so all events disappear, not a TTL on individual events so just some of them disappear (I'm remembering A.Finance.Fraud :)). A TTL that hard deletes is fine (i.e. best) for our need.

The above could be implemented using a TTL on the commits collection behind the scenes though, providing that implementation was invisible to the caller.

I don't have need of a stream metadata document, but I'm happy to check a Boolean on such a document if that is preferred by everyone? We would make two API calls every time we loaded an event stream though, so possibly just one API method that returned both the stream metadata and the events would be good to provide opportunity to avoid multiple calls to CosmosDB under the hood.

I didn't think this preference for a stream TTL over an event TTL was weird given the desired immutability of events within a stream, but if I'm alone in that view ignore me...

@DaveAurionix
Copy link
Author

DaveAurionix commented Nov 7, 2017

Oh I get what you're saying ... we could soft delete ourselves using our own event, but then hard-delete with a TTL after that point in time. I still think it creates a strange unit testing demand on an aggregate, because we would need to test that it would not fail / have adverse side-effects when replaying event streams where the first parts of the stream had been deleted (lots of test cases too?), until it encountered the "soft delete" event. It's not as simple for the consumer as "whole stream has gone".

Also, the challenge would be storing the soft event. We'd need a process that triggered/stored that event, e.g. x months after an interaction had stopped.

@DaveAurionix
Copy link
Author

Sorry, thought about this some more. A hard-delete TTL on the commits collection (i.e. on events not on streams) would work fine for us exactly as you originally proposed it. We have no need of the stream metadata document at the moment but it wouldn't cause us any problems if you did want to add it.

We only retrieve entire streams so would detect a stream with a missing first event in our event sourcing library and surface it as stream-not-found to application code.

I think it is neater if SES had the concept of a TTL on the entire stream, but if that is not really a good concern for the event store then our event sourcing abstraction on top of the store would be where we supported it.

@asosMikeGore
Copy link

I've pushed an early version of this out to the supportTTL branch. Any feedback would be great. I'd want to add a few more tests just to make sure stream reads & writes aren't impacted before merging this in - can't see why they would be based on the existing implementation though.

At present, the TTL would only be configured for new stores via the Initialize method - for existing event stores, we could handle this a few different ways:

  • Set outside of the API e.g. run a script to reconfigure.
  • Keep checking this and reconfigure as part of Initialize

Let me know what you think works.

@DaveAurionix
Copy link
Author

The changes in that branch look perfect to me, thanks for that.

I'm happy to change TTL using a script outside of SES.

As we're checking if a collection exists on initialise, it doesn't sound more costly to instead retrieve the collection configuration and check it (then create the collection if it doesn't exist, modify it if it does), However we'd have to manage (or at least test) the race condition between multiple consumers starting at the same time both trying to "correct" the TTL or other config if it has changed. Probably better to tackle that as a separate PR and then only if we find config changing often?

@asosMikeGore
Copy link

Agree with you that the Initialize shouldn't be trying to change the config once applied. We could potentially provide some scripts for convenience, but in this instance it feels a little bit like overkill as this is just a standard feature of Cosmos.

I'll add the additional tests and then push that into master. I'm aiming to do that by the end of the week at the latest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants