Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TipSetIndexer task result persistence race condition #374

Closed
frrist opened this issue Jan 26, 2021 · 2 comments · Fixed by #370 or #396
Closed

TipSetIndexer task result persistence race condition #374

frrist opened this issue Jan 26, 2021 · 2 comments · Fixed by #370 or #396
Labels
kind/bug Kind: Bug status/moreinfo More information is required to progress the issue

Comments

@frrist
Copy link
Member

frrist commented Jan 26, 2021

Describe the bug:

The TipSetIndexer's TipSet method may return before all task outputs have been persisted to storage. Its usage in WalkChain can lead to Walker's Run method returning before persistence has completed and can therefore cause Visor to exit before all extracted data has been persisted.

go routine responsible for persisting task outputs: https://github.com/filecoin-project/sentinel-visor/blob/dcc3c5769986613c465a67e9d54b0d11a559ceda/chain/indexer.go#L266-L293

I think the fix could be as simple as moving the WaitGroup instantiation calls and wg.Wait() call outside of the go routine -- see 9fbd5a9 for details

Steps to Reproduce:

I ran into this while in my vector bench marking branch frrist/vector-builder.
To Repo using the aforementioned branch:

  1. ensure you have a lotus datastore at ~/.lotus
  2. run visor --log-level=info vector build --from=446000 --to=446010 --tasks=blocks,messages --vector-file=./visor-vectors/vector.json
  3. run ./visor --log-level=info vector execute --vector-file=./visor-vectors/vector.json
  4. repeat step 3 until you get a panic... race conditions are fun

Visor Version: dcc3c57

@frrist frrist added the kind/bug Kind: Bug label Jan 26, 2021
frrist added a commit that referenced this issue Jan 26, 2021
@iand
Copy link
Contributor

iand commented Jan 27, 2021

I don't think your suggested fix is correct. The TipSet method is designed to return before persistance is complete: the goal is to persist the results of an extraction while the next tipset extraction is executing. We then have 30 seconds to extract plus 30 seconds to persist (which is mostly IO or idle wait time). Your change would mean the extraction and persistance have to happen within the 30 second epoch of a single tipset.

t.persistSlot acts as a semaphore so that only one extraction + persistance pair is in flight at any one time.

Can you add the panic backtrace to the issue?

@iand iand added the status/moreinfo More information is required to progress the issue label Feb 10, 2021
frrist added a commit that referenced this issue Feb 18, 2021
frrist added a commit that referenced this issue Feb 18, 2021
This reverts commit 85eb065df81f0ab2688530772c72f078344c9bab.
frrist added a commit that referenced this issue Feb 18, 2021
- will revert after a proper fix is designed.
@frrist
Copy link
Member Author

frrist commented Feb 19, 2021

reopening as the current fix his a hack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Kind: Bug status/moreinfo More information is required to progress the issue
Projects
None yet
2 participants