You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A solution for this would be to let core check documents when they are returned from the cache. Any document that was fetched but not touched when returned should be considered a failed document (not just reset the status since it may be the document's fault that the stage crashed)
The text was updated successfully, but these errors were encountered:
Some background:
A problem in pipelines with unstable stages is that documents may be fetched by a stage, which then crashes. This will cause the document to remain in a fetched state with no matching touched state, forever. Related discussion: #307 (unrecoverable stage failure will cause the same issues)
Since Core doesn't keep track of who fetched what document (that information is stored on the document level), there is nothing that checks if this state of fetched-not-touched has occurred.
Core hasa change to check document state in one case; when flushing documents to the backing database (all documents will pass through flush().
Only OOM and other JVM-level errors will cause this problem - any uncaught exceptions fail the document (since 0.5.0).
Documents will need to be failed since it can't be known what caused the crash.
Some possible solutions:
Check documents on flush() in the cache (implemented in laserval@2d5c587)
Only works when cache is used
All failure can be logged, but document should be failed by Core
Adds responsibility for checking document state to the cache
Add a check for fetched-not-touched on stage startup
Requires RemotePipeline API change
May require database API change
Race condition when multiple stages need to mark the failure
Stage has responsibility for setting document state
When running without cache with multiple Cores, a stage could mark a document that is being processed on another instance as failed
Add a pipeline-level timeout to documents
Needs to poll database or Core for fetched-not-touched documents
May require database API change
Adds a monitor thread
Implement stage document subscription (see Subscription model #272) and timeouts for documents in Core
Extensive changes to Stage API
Gives Core responsibility for document state
A general problem is that a document can only be failed once. When several stages crash, it's not clear which should be the one to mark the document as failed.
To me it seems the long-term solution is #272 and #307, as you said: stages subscribe to documents using their query, and Core can then keep track of timeouts and other document state variables. Not entirely sure how that would happen though - but the details can be discussed in those issues.
A solution for this would be to let core check documents when they are returned from the cache. Any document that was fetched but not touched when returned should be considered a failed document (not just reset the status since it may be the document's fault that the stage crashed)
The text was updated successfully, but these errors were encountered: