Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a cluster_by_episode() operation on date series #2198

Open
evansd opened this issue Oct 31, 2024 · 0 comments
Open

Add a cluster_by_episode() operation on date series #2198

evansd opened this issue Oct 31, 2024 · 0 comments

Comments

@evansd
Copy link
Contributor

evansd commented Oct 31, 2024

The intention here is to provide a more flexible, composable alternative to the existing count_episodes_for_patient aggregation.

The idea is that given a date event series like this:

date
2020-01-01
2020-01-04
2020-01-06
2020-01-10
2020-01-12

We could "cluster" it into episodes, defining the maximum gap allowed between episodes like so:

date.cluster_by_episode(days(3))

And this would produce a new event frame with two columns giving the start and end dates of each episode:

start_date end_date
2020-01-01 2020-01-06
2020-01-10 2020-01-12

Calling count_for_patient() on this frame would give the same result as the existing count_episodes_for_patient() aggregation. But all sorts of other operation are now possible e.g. calculating total episode length and including only those over a certain length:

episodes = events.date.cluster_by_episode(days(3))
episode_lengths = episodes.end_date - episode.start_date
long_episodes = episodes.where(episode_lengths.days >= 14)

Original Slack discussion with motivating use cases:
https://bennettoxford.slack.com/archives/C01D7H9LYKB/p1729785637716699

Implementation notes

This is a new kind of operation in the sense that it takes a series and returns a frame. It's also unusual in that the frame class it returns is not one already defined in a schema module. In the query_language module we'll probably want to define a dedicated frame class for this with start_date and end_date series.

Note that this operation is not an aggregation (it doesn't result in a one-row-per-patient result) so we shouldn't use the _for_patient() suffix in naming it.

I don't have a clear sense of how fiddly or not this will be to implement in SQL. The existing episode counting code uses a LAG window function to do it's work. It may be that a simple adaption of this code will do what we need. Or there may turn out to be some significant blocker I haven't thought of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant