Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some cron job triggered appear to be running indefinitely #1704

Open
BenjaminDecreusefond opened this issue Jan 15, 2025 · 4 comments
Open
Labels
bug Something isn't working

Comments

@BenjaminDecreusefond
Copy link
Contributor

Bug description 🐞

This is probably an obscur bug but sometimes when a cronjob is triggered it seems to be running indefinitely and never stop
Screenshot 2025-01-15 at 10 18 01
Screenshot 2025-01-15 at 10 18 24
Lemme know if you need any more infos !

Steps to reproduce

I'll have to apologize but I can't provide any step to reproduce as to me it seems very random as you can see from screens above.

Expected behavior

Should stop at some point.

Example repository

No response

Anything else?

No response

@BenjaminDecreusefond BenjaminDecreusefond added the bug Something isn't working label Jan 15, 2025
@alfespa17
Copy link
Member

There is a job that cancel long running executions if a job is running for more than 6 hours here:

log.error("Job has been running for more than 6 hours, cancelling running job {}", job.getId());

Check the logs maybe you can find something related.

@BenjaminDecreusefond
Copy link
Contributor Author

Hi !

I think I found something that might be related! After digging into the db I realized that some were kind of corrupted and prevented terrakube from processing some jobs and even made some crash. Sometimes (Idk the exact reason) the field terraform_plan of relation job was empty which caused the following exception


Inactive Job 32 should be completed before Tue Oct 22 16:04:28 UTC 2024, current time Thu Jan 23 13:50:00 UTC 2025 |  
-- | --
  |   | 2025-01-23 14:50:00.359 | 2025-01-23T13:50:00.358Z ERROR 1 --- [ryBean_Worker-7] o.t.a.p.scheduler.inactive.InactiveJobs  : Job has been running for more than 6 hours, cancelling running job 32 |  
  |   | 2025-01-23 14:50:00.359 | 2025-01-23T13:50:00.359Z  WARN 1 --- [ryBean_Worker-7] o.t.a.p.scheduler.inactive.InactiveJobs  : Cancelling pending steps |  
  |   | 2025-01-23 14:50:00.363 | 2025-01-23T13:50:00.363Z  INFO 1 --- [ryBean_Worker-7] o.t.a.p.scheduler.inactive.InactiveJobs  : No information to update for job |  
  |   | 2025-01-23 14:50:00.369 | 2025-01-23T13:50:00.369Z  WARN 1 --- [ryBean_Worker-7] o.h.engine.jdbc.spi.SqlExceptionHelper   : SQL Error: 0, SQLState: 23502 |  
  |   | 2025-01-23 14:50:00.369 | 2025-01-23T13:50:00.369Z ERROR 1 --- [ryBean_Worker-7] o.h.engine.jdbc.spi.SqlExceptionHelper   : ERROR: null value in column "workspace_id" of relation "job" violates not-null constraint |  
  |   | 2025-01-23 14:50:00.369 | Detail: Failing row contains (32, failed, null, 179acb9d-5119-49db-a7b6-ed3aeca88daa, null, 2024-10-22 10:04:28.547, 2025-01-23 13:50:00.363, [email protected], Internal, null, ZmxvdzoKICAtIHR5cGU6ICJ0ZXJyYWZvcm1QbGFuIgogICAgc3RlcDogMTAwCiAg..., null, 9163e61b-b12e-4f72-a3f5-5eb5918baaa6, null, null, UI, null, null, t, f, f, t). |  
  |   | 2025-01-23 14:50:00.379 | 2025-01-23T13:50:00.378Z ERROR 1 --- [ryBean_Worker-7] org.quartz.core.JobRunShell              : Job DEFAULT.TerrakubeV2_InactiveJobs threw an unhandled Exception: |  
  |   | 2025-01-23 14:50:00.379 |   |  
  |   | 2025-01-23 14:50:00.379 | org.springframework.dao.DataIntegrityViolationException: could not execute statement [ERROR: null value in column "workspace_id" of relation "job" violates not-null constraint |  
  |   | 2025-01-23 14:50:00.379 | Detail: Failing row contains (32, failed, null, 179acb9d-5119-49db-a7b6-ed3aeca88daa, null, 2024-10-22 10:04:28.547, 2025-01-23 13:50:00.363, [email protected], Internal, null, ZmxvdzoKICAtIHR5cGU6ICJ0ZXJyYWZvcm1QbGFuIgogICAgc3RlcDogMTAwCiAg..., null, 9163e61b-b12e-4f72-a3f5-5eb5918baaa6, null, null, UI, null, null, t, f, f, t).] [update job set approval_team=?,auto_apply=?,comments=?,commit_id=?,created_by=?,organization_id=?,output=?,override_branch=?,override_source=?,plan_changes=?,refresh=?,refresh_only=?,status=?,tcl=?,template_reference=?,terraform_plan=?,updated_by=?,updated_date=?,via=?,workspace_id=? where id=?]; SQL [update job set approval_team=?,auto_apply=?,comments=?,commit_id=?,created_by=?,organization_id=?,output=?,override_branch=?,override_source=?,plan_changes=?,refresh=?,refresh_only=?,status=?,tcl=?,template_reference=?,terraform_plan=?,updated_by=?,updated_date=?,via=?,workspace_id=? where id=?]; constraint [workspace_id" of relation "job]

After deleting all rows in relation

DELETE FROM step 
WHERE job_id IN (SELECT id FROM job WHERE terraform_plan IS NULL);

and running

delete from job where terraform_plan is null;

The problem seem to be solved and we no longer have that error !

Don't know if it can help ! At least referencing my issue for others if they encounter it ! :)

@alfespa17
Copy link
Member

I think the correct script should be for ERROR: null value in column "workspace_id" of relation "job" violates not-null constraint

Script 1

DELETE FROM step 
WHERE job_id IN (SELECT id FROM job WHERE workspace_id IS NULL);

Script 2

delete from job where workspace_id is null;

@BenjaminDecreusefond
Copy link
Contributor Author

Strangely, when I took a look at workspace_id they seemed to have a value ! However, terraformPlan was empty so I went for this column !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants