gapfind produces incorrect set of heights,tasks to fill #768

placer14 · 2021-11-18T05:04:10Z

Trying to reconcile an internal "lily data completeness" query, I believe I've identified some cases where gapfind does not detect holes and incorrectly produces holes which don't need filling.

Within chain/find.go you execute GapIndexer.findTaskEpochGaps w the following query (which I've refactored without adjusting the behavior for clarity with comments included inline indicating the logic bugs):

with
all_heights_and_tasks as (
	select height, task
    from visor_processing_reports
    where height between ? and ?
	group by 1, 2
)
, incomplete_heights as (
	select height, count(task)
	from all_heights_and_tasks
	group by 1
        -- this incorrectly handles NULL_ROUND heights which
        -- only have a single task (gap_find or consensus)
        -- PS: 14 comes from len(AllTasks) in the calling go function
	having count(task) != 14
)
, incomplete_heights_and_their_completed_tasks as (
	select ih.height, pr.task, pr.status_information
	from incomplete_heights ih
	left join visor_processing_reports pr using (height)

	-- this condition excludes NULL_ROUND tasks
	-- which are completed and should be included
	where pr.status_information != 'NULL_ROUND'

	-- what is this filter for?
	-- this will accidentally include rows `ih`
	-- which don't have a joined row on `pr` (which
	-- happens when there's no task history on those heights)
	-- creating heights with no completed tasks at all
	or pr.status_information is null
	group by 1, 2, 3
)
select height, task
from incomplete_heights_and_their_completed_tasks
order by height desc
;

The text was updated successfully, but these errors were encountered:

hsanjuan · 2021-11-18T08:44:58Z

-- this will accidentally include rows ih
-- which don't have a joined row on pr (which
-- happens when there's no task history on those heights)

I think all rows on ih have a row in pr or they would no be on ih.

The main thing is that if filters out certain (height, task) because it assumes that a status_information == null means OK but there are also errors that have null status_information.

But yeah, otherwise the query is very fragile and I think it works in most cases by chance (count != 14) and not by design.

placer14 · 2021-11-18T14:06:42Z

I've rewritten this without trying to place a cludgy fix over it.

this speeds up execution (casual observations suggest around 10x improvement)
made the query inputs a little more obvious
removed bugs around edges cases
this includes the diffing logic within the query which was previously happening after the query in go
this logic appears to be complete according to spot checks in both directions (ensure this query points at heights that actually contain gaps as well as gaps manually perceived in visor reports which appear in the results of the query)

There are some tests which this change is failing and I have some additions to include.

frrist · 2021-11-22T18:41:13Z

Based on some conversation here is my understanding of the current state of this bug.

Accuray Query used to find gaps in data for grafana dashboards that produces different results than the internal gap find query from lily:

with reports as ( --selects all heights that should have task reports
  select height, task, status from visor_processing_reports
  where height between unix_to_height($__unixEpochFrom()) and unix_to_height($__unixEpochTo())
    and task != 'gap_find'
    and status != 'INFO'
),

tasks as ( -- selects all tasks in the reports
  select task, count(task) as task_count from reports group by task
),

report_tasks as ( -- makes a table with a height entry for every known task
  select a.height, b.task from (select height from reports group by height) a
  join tasks b on true
),

noreports as ( -- height, task, 1 entries whenever a height does not have a corresponding task entry - no OK, no SKIP, no ERROR.
  select rt.height, rt.task, 1 as count from report_tasks rt
  left join reports r on rt.height = r.height and rt.task = r.task
  where r.height is  NULL
),

oks as (
  select height, task from reports where status = 'OK'
  group by height, task
),

skips as (
  select height, task from reports where status = 'SKIP'
  group by height, task
),

errors as (
  select height, task from reports where status = 'ERROR'
  group by height, task
),

missing_skip as ( -- tasks that have SKIP but not OK for every height
  select
    sk.height, sk.task, count(sk.task)
  from skips sk
  left join oks ok on ok.height = sk.height and ok.task = sk.task
  where ok.height is NULL
  group by sk.height, sk.task
),

missing_error as ( -- tasks that have error but not OK for every height
  select
    er.height, er.task, count(er.task)
  from errors er
  left join oks ok on ok.height = er.height and ok.task = er.task
  where ok.height is NULL
  group by er.height, er.task
),

missing_all as ( -- put all the above together.
  select r.height, r.task, coalesce(mi.count,0) as count
  from report_tasks r
  left join (
   select * from missing_skip
   union
   select * from missing_error
   union
   select * from noreports
  ) mi on r.height = mi.height and r.task = mi.task
)

select time_bucket('$sum_bucket', to_timestamp(height_to_unix(height))) as time,
  task, sum(count) as "sum"
from missing_all
group by time, task
order by time

The missing_error portion of the query is what will cause some of the diff since lily's gap fill doesn't treat errors as gap:

missing_error as ( -- tasks that have error but not OK for every height
  select
    er.height, er.task, count(er.task)
  from errors er
  left join oks ok on ok.height = er.height and ok.task = er.task
  where ok.height is NULL
  group by er.height, er.task
),

placer14 · 2021-11-23T16:32:54Z

There was some misunderstanding on my part around the behavior of this query for epoch-task reports which have an ERRORing status. Specifically, the existing behavior was omitting ERRORs on purpose (with the thinking that they would never resolve to anything other than ERROR) which were the only false negatives I could identify. There were also some false positives identified and being tracked in #773. This seems to be the extent of the problems I can see.

The fall out from this is at least a faster, refactored query and some questions to answer about how we want to handle ERROR cases for find/fill (as well as some discussion for how to represent NULL_ROUNDs).

FWIW, I’ve noticed that there are processing records with ERRORs which show as filled in the gap reports suggesting that errors could be transient and are worth retrying. (the mechanism of which can be discussed)

placer14 · 2021-11-24T13:34:11Z

Closing this issue, as it is being superseded by #773 and #775 with more specificity around the actual bugs indicated above.

placer14 added the kind/bug Kind: Bug label Nov 18, 2021

placer14 self-assigned this Nov 18, 2021

placer14 mentioned this issue Nov 18, 2021

Simplify gapfind hole detection #769

Closed

placer14 closed this as completed Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gapfind produces incorrect set of heights,tasks to fill #768

gapfind produces incorrect set of heights,tasks to fill #768

placer14 commented Nov 18, 2021

hsanjuan commented Nov 18, 2021

placer14 commented Nov 18, 2021

frrist commented Nov 22, 2021 •

edited

Loading

placer14 commented Nov 23, 2021 •

edited

Loading

placer14 commented Nov 24, 2021

gapfind produces incorrect set of heights,tasks to fill #768

gapfind produces incorrect set of heights,tasks to fill #768

Comments

placer14 commented Nov 18, 2021

hsanjuan commented Nov 18, 2021

placer14 commented Nov 18, 2021

frrist commented Nov 22, 2021 • edited Loading

placer14 commented Nov 23, 2021 • edited Loading

placer14 commented Nov 24, 2021

frrist commented Nov 22, 2021 •

edited

Loading

placer14 commented Nov 23, 2021 •

edited

Loading