Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The document that was processed by a stage when it crashed is left unhandled in the pipeline #328

Open
ssimon opened this issue Apr 10, 2014 · 4 comments

Comments

@ssimon
Copy link

ssimon commented Apr 10, 2014

A solution for this would be to let core check documents when they are returned from the cache. Any document that was fetched but not touched when returned should be considered a failed document (not just reset the status since it may be the document's fault that the stage crashed)

@laserval
Copy link
Contributor

Some background:
A problem in pipelines with unstable stages is that documents may be fetched by a stage, which then crashes. This will cause the document to remain in a fetched state with no matching touched state, forever. Related discussion: #307 (unrecoverable stage failure will cause the same issues)

Since Core doesn't keep track of who fetched what document (that information is stored on the document level), there is nothing that checks if this state of fetched-not-touched has occurred.
Core hasa change to check document state in one case; when flushing documents to the backing database (all documents will pass through flush().

Only OOM and other JVM-level errors will cause this problem - any uncaught exceptions fail the document (since 0.5.0).

Documents will need to be failed since it can't be known what caused the crash.

Some possible solutions:

  • Check documents on flush() in the cache (implemented in laserval@2d5c587)
    • Only works when cache is used
    • All failure can be logged, but document should be failed by Core
    • Adds responsibility for checking document state to the cache
  • Add a check for fetched-not-touched on stage startup
    • Requires RemotePipeline API change
    • May require database API change
    • Race condition when multiple stages need to mark the failure
    • Stage has responsibility for setting document state
    • When running without cache with multiple Cores, a stage could mark a document that is being processed on another instance as failed
  • Add a pipeline-level timeout to documents
    • Needs to poll database or Core for fetched-not-touched documents
    • May require database API change
    • Adds a monitor thread
  • Implement stage document subscription (see Subscription model #272) and timeouts for documents in Core
    • Extensive changes to Stage API
    • Gives Core responsibility for document state

A general problem is that a document can only be failed once. When several stages crash, it's not clear which should be the one to mark the document as failed.

@remen
Copy link
Contributor

remen commented Apr 10, 2014

I think combining #272 with transactional queues is the "go-to" solution for this problem

This situation should, as you said, be much less common in 0.5.0 however!

@remen
Copy link
Contributor

remen commented Apr 10, 2014

(sorry for spamming you with stupid solution ideas that you already listed above :D)

@laserval laserval added this to the 0.6.0 milestone Apr 11, 2014
@laserval
Copy link
Contributor

To me it seems the long-term solution is #272 and #307, as you said: stages subscribe to documents using their query, and Core can then keep track of timeouts and other document state variables. Not entirely sure how that would happen though - but the details can be discussed in those issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants