-
Notifications
You must be signed in to change notification settings - Fork 54
Terminate PG query when task is killed via Airflow #717
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really great so far! The approach seems sound to me, although I agree this will be a tricky one to test.
I tried testing this by just adding a SELECT pg_sleep(120)
to the create_loading_table
task, and adjusting its execution timeout to 10 seconds. When I ran a provider script locally, though, the task hung for the full 120 seconds and only then raised a Timeout error 🤔 Are you seeing this too?
I wouldn't want to kill a long-running db job that just needs to run, but I would want to be able to kill other long-running things that might get in the first one's way.
Sorry, I didn't quite get this -- can you give an example of what you mean here?
Thanks @stacimc ! Airflow log
DB jobs query
|
Got it, thank you! And I think you're right, we'll want to make sure we're only killing jobs associated with the failed task. I checked out the temp commit you pushed to test a long-running query on I did a little bit of digging and got as far as finding a way of getting the timeouts to work in PostgresOperators, by passing in a
When I run that, the
I don't know how helpful this is, but maybe something to look into? |
Thanks so much @stacimc ! I have code in a temporary test dag that I just committed that works for both setting the database So to the extent that my goal was to learn more about the broad array of possible ways to get this done, I feel pretty accomplished. 🤣 But I still am not 100% sure what would be the best thing to do for openverse. Here are some notes based on the different ways that the catalog is running postgres jobs, and what I'm learning about these different kill-the-job design options.
So, yeah, I think that my next steps / open questions are:
Thanks again! |
This is such an interesting PR, @rwidom! I am learning a lot about how this works 😄 I love that you've added the temporary test DAG, that makes it so easy to play around with. I have a bunch of thoughts, apologies for not organizing them super well. Let me know if this makes sense! CommoncrawlFor psycopg2, it looks like that's only used in the crawler as you've pointed out. The crawler hasn't been touched in a long time and is not running, and to my knowledge it's not been prioritized in any near term. I think we can avoid spending too much time on this part, but it's awesome that you've gotten it working! Custom operatorsAs I understand what you've said: with
Just for the sake of discussion 😄, have you thought about making a custom operator for Approach/scope
I think the way we've been writing these, it's safe to assume the
Good question! Your idea that we'll need to be careful to kill only the query associated with the timed out task makes intuitive sense to me, but I'm really not very familiar with the database internals. Pinging @AetherUnbound for your thoughts on this especially 🧠 I do like that this approach lets you query for the specific query to kill using nothing but the task context. If we end up making our own custom operator, maybe there's something we can hook into to do clean up on timeout? I only looked briefly at the source, maybe on_kill?
Another good question 😄 I think this would be useful just about everywhere, certainly in the loader steps of the provider DAGs. It's totally reasonable for this to be split across multiple PRs, however! And as I said, the commoncrawl steps can be deprioritized since that's not running and we're not able to test it. |
Ooh! I didn't know about |
Sorry it took me so long to get to this! I love the deep dive you've done @rwidom and I've learned a bunch 🚀 The One thing I'm getting caught up on as you're describing this:
This doesn't quite match my understanding of what Airflow is capable of, do you mind elaborating on this a bit? At least with the If we wanted other attributes (e.g. Personally, I think that the best way to go would be Heck, we may even want a |
Hey, @AetherUnbound, I was just thinking about returning to this. Sorry for the delay! Is it still high priority? Does #939 make it irrelevant now? I am more familiar with xcoms and parameters now, so I might be better equipped to finish this up, but just trying to figure out if that makes sense right now. |
I don't think this is made irrelevant by #939, but the diff might look a bit different because of the dependency changes there. I think terminating the query is still important and so could still be high priority, but I could also see it being moved down to medium since we haven't encountered this particular issue recently 🤔 What do you think @WordPress/openverse-catalog? |
b79b9bc
to
a77c6b2
Compare
OK, I don't know if it's just that I haven't had enough real focus time on this, or that I'm really missing the boat on some basic airflow concepts or what, but here's where I think things are...
So, I've been doing a lot of thinking in circles, and not making a ton of real progress, but now you have the latest @AetherUnbound and @stacimc . Help? |
@rwidom and I had a quick synchronous discussion where we talked about the various possibilities for this endeavor and what might make the most sense moving forward. She prepared this excellent doc describing what possibilities we have. Here are the outcomes of that discussion:
I also shared this pseudo-code with Rebecca for what the hook might look like based off of her initial code: class TimedPostgresHook(PostgresHook):
"""
PostgresHook that sets the database timeout on any query to match the airflow task
execution timeout.
"""
def __init__(self, *args, default_statement_timeout: str = None, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.default_statement_timeout = default_statement_timeout
def get_pg_timeout_sql(self, statement_timeout: str) -> str:
return f"SET statement_timeout TO '{statement_timeout}s';"
def run(self, sql, statement_timeout: str = None, autocommit=False, parameters=None, handler=None):
statement_timeout = statement_timeout or self.default_statement_timeout
if statement_timeout:
sql = [self.get_pg_timeout_sql(statement_timeout) + sql]
super().run(sql, autocommit, parameters, handler) I think that covers everything we discussed, but Rebecca please chime in if I missed anything 😄 |
Thanks so much @AetherUnbound ! That's perfect! One more thought on naming: I have these things in a file called sql_helpers, but I think sql_timeout would probably be more clear. Thoughts? |
Co-authored-by: Staci Mullins <[email protected]>
….com/WordPress/openverse-catalog into issue/690-kill-pg-query-from-airflow
Hello! I think I responded above about the timeouts stacking, but yes, it's something to be aware of. It's really per SQL statement, not per task, and a task can have multiple SQL statements. But maybe future PR(s) can add Also, yes, thanks so much for catching those
For now, I did a lot of switching to kwargs in the function calls, mainly because it seemed like the path of least resistance and most clarity, but let me know if I should be doing something else instead. |
Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR: @krysal Excluding weekend1 days, this PR was updated 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2. @rwidom, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes |
I think I'd still prefer the |
My only concern is that I have already done a bunch (if not all, I think all) of those changes in the function calls and tests, so I think I would need to now go back through all of those changes and change them again. Also, I'm not entirely sure how we want the error handling to work if the task is None. Just set the default task timeout and act like everything is ok? Add some info to the log? Raise an error? Like, how important is it that the task timeouts we're setting actually get implemented (setting aside stacking concerns) at the level of detail we're setting? If this were only Airflow doing the passing, then I would be less concerned, but there are the situations where we have to pass the task more than once (i.e. through multiple function calls for the same step in a dag), so just listing the task in the kwargs won't make it automagically appear everywhere it needs to be. |
These were removed in #985 but were actually still being issued, I had just written a very poor filter that apparently captured everything...
Per an offline discussion that we had, I took on the effort of moving the function parameters around so that the I'm going to try to run a bunch of DAGs locally just to be sure, but I think this should be good to go now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much Rebecca for all your effort with this one! Can't wait to have it in 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Lowered the ingestion limit and ran just about all of our DAGs including the data refreshes. The changes to the parameters look great @AetherUnbound 💯
Thanks so much again for your incredible work here @rwidom! I'm so excited to see this in 🎉
This just improves testing speeds for most test runs, full suite runs will happen in CI/CD
Thank you so much for doing this last piece @AetherUnbound !!! This is a really small nit, but I think that technically tasks can be either |
Ugh, I just started to make that change myself and got annoyed with myself about navigating the imports and wondering is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the AbstractOperator
typing looks great! This is ready to me - merge is "go" on your command 🚀😁
Fixes
Fixes WordPress/openverse#1455 by @AetherUnbound
Description
This summarizes a lot of discussion (and help from @stacimc and @AetherUnbound!):
SET statement_timeout TO '{execution_timeout}s';
before any other SQL, so that postgres knows when to kill long-running queries.Testing Instructions
Switch to the branch and
just test
, and then run some dags. I ran the Cleveland Museum and iNaturalist (with small test files) locally and it went fine.Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin