-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gapfind produces incorrect set of heights,tasks to fill #768
Comments
I think all rows on The main thing is that if filters out certain (height, task) because it assumes that a status_information == null means OK but there are also errors that have null status_information. But yeah, otherwise the query is very fragile and I think it works in most cases by chance (count != 14) and not by design. |
I've rewritten this without trying to place a cludgy fix over it.
There are some tests which this change is failing and I have some additions to include. |
Based on some conversation here is my understanding of the current state of this bug. Accuray Query used to find gaps in data for grafana dashboards that produces different results than the internal gap find query from lily: with reports as ( --selects all heights that should have task reports
select height, task, status from visor_processing_reports
where height between unix_to_height($__unixEpochFrom()) and unix_to_height($__unixEpochTo())
and task != 'gap_find'
and status != 'INFO'
),
tasks as ( -- selects all tasks in the reports
select task, count(task) as task_count from reports group by task
),
report_tasks as ( -- makes a table with a height entry for every known task
select a.height, b.task from (select height from reports group by height) a
join tasks b on true
),
noreports as ( -- height, task, 1 entries whenever a height does not have a corresponding task entry - no OK, no SKIP, no ERROR.
select rt.height, rt.task, 1 as count from report_tasks rt
left join reports r on rt.height = r.height and rt.task = r.task
where r.height is NULL
),
oks as (
select height, task from reports where status = 'OK'
group by height, task
),
skips as (
select height, task from reports where status = 'SKIP'
group by height, task
),
errors as (
select height, task from reports where status = 'ERROR'
group by height, task
),
missing_skip as ( -- tasks that have SKIP but not OK for every height
select
sk.height, sk.task, count(sk.task)
from skips sk
left join oks ok on ok.height = sk.height and ok.task = sk.task
where ok.height is NULL
group by sk.height, sk.task
),
missing_error as ( -- tasks that have error but not OK for every height
select
er.height, er.task, count(er.task)
from errors er
left join oks ok on ok.height = er.height and ok.task = er.task
where ok.height is NULL
group by er.height, er.task
),
missing_all as ( -- put all the above together.
select r.height, r.task, coalesce(mi.count,0) as count
from report_tasks r
left join (
select * from missing_skip
union
select * from missing_error
union
select * from noreports
) mi on r.height = mi.height and r.task = mi.task
)
select time_bucket('$sum_bucket', to_timestamp(height_to_unix(height))) as time,
task, sum(count) as "sum"
from missing_all
group by time, task
order by time The missing_error portion of the query is what will cause some of the diff since lily's gap fill doesn't treat errors as gap: missing_error as ( -- tasks that have error but not OK for every height
select
er.height, er.task, count(er.task)
from errors er
left join oks ok on ok.height = er.height and ok.task = er.task
where ok.height is NULL
group by er.height, er.task
), |
There was some misunderstanding on my part around the behavior of this query for epoch-task reports which have an The fall out from this is at least a faster, refactored query and some questions to answer about how we want to handle FWIW, I’ve noticed that there are processing records with ERRORs which show as filled in the gap reports suggesting that errors could be transient and are worth retrying. (the mechanism of which can be discussed) |
Trying to reconcile an internal "lily data completeness" query, I believe I've identified some cases where gapfind does not detect holes and incorrectly produces holes which don't need filling.
Within
chain/find.go
you executeGapIndexer.findTaskEpochGaps
w the following query (which I've refactored without adjusting the behavior for clarity with comments included inline indicating the logic bugs):The text was updated successfully, but these errors were encountered: