The document that was processed by a stage when it crashed is left unhandled in the pipeline #328

ssimon · 2014-04-10T12:47:35Z

A solution for this would be to let core check documents when they are returned from the cache. Any document that was fetched but not touched when returned should be considered a failed document (not just reset the status since it may be the document's fault that the stage crashed)

laserval · 2014-04-10T15:28:40Z

Some background:
A problem in pipelines with unstable stages is that documents may be fetched by a stage, which then crashes. This will cause the document to remain in a fetched state with no matching touched state, forever. Related discussion: #307 (unrecoverable stage failure will cause the same issues)

Since Core doesn't keep track of who fetched what document (that information is stored on the document level), there is nothing that checks if this state of fetched-not-touched has occurred.
Core hasa change to check document state in one case; when flushing documents to the backing database (all documents will pass through flush().

Only OOM and other JVM-level errors will cause this problem - any uncaught exceptions fail the document (since 0.5.0).

Documents will need to be failed since it can't be known what caused the crash.

Some possible solutions:

Check documents on flush() in the cache (implemented in laserval@2d5c587)
- Only works when cache is used
- All failure can be logged, but document should be failed by Core
- Adds responsibility for checking document state to the cache
Add a check for fetched-not-touched on stage startup
- Requires RemotePipeline API change
- May require database API change
- Race condition when multiple stages need to mark the failure
- Stage has responsibility for setting document state
- When running without cache with multiple Cores, a stage could mark a document that is being processed on another instance as failed
Add a pipeline-level timeout to documents
- Needs to poll database or Core for fetched-not-touched documents
- May require database API change
- Adds a monitor thread
Implement stage document subscription (see Subscription model #272) and timeouts for documents in Core
- Extensive changes to Stage API
- Gives Core responsibility for document state

A general problem is that a document can only be failed once. When several stages crash, it's not clear which should be the one to mark the document as failed.

remen · 2014-04-10T15:33:44Z

I think combining #272 with transactional queues is the "go-to" solution for this problem

This situation should, as you said, be much less common in 0.5.0 however!

remen · 2014-04-10T15:37:10Z

(sorry for spamming you with stupid solution ideas that you already listed above :D)

laserval · 2014-04-11T06:59:27Z

To me it seems the long-term solution is #272 and #307, as you said: stages subscribe to documents using their query, and Core can then keep track of timeouts and other document state variables. Not entirely sure how that would happen though - but the details can be discussed in those issues.

laserval added core labels Apr 10, 2014

laserval added this to the 0.6.0 milestone Apr 11, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The document that was processed by a stage when it crashed is left unhandled in the pipeline #328

The document that was processed by a stage when it crashed is left unhandled in the pipeline #328

ssimon commented Apr 10, 2014

laserval commented Apr 10, 2014

remen commented Apr 10, 2014

remen commented Apr 10, 2014

laserval commented Apr 11, 2014

The document that was processed by a stage when it crashed is left unhandled in the pipeline #328

The document that was processed by a stage when it crashed is left unhandled in the pipeline #328

Comments

ssimon commented Apr 10, 2014

laserval commented Apr 10, 2014

remen commented Apr 10, 2014

remen commented Apr 10, 2014

laserval commented Apr 11, 2014