Method for signaling unrecoverable Stage failure #307

laserval · 2014-02-03T13:04:22Z

Currently, Stages can signal that the parameters set in their configuration don't work on startup. After a few tries, Core will stop starting the stage and log that.

The same can't be done when the Stage is already running; the only things a Stage can do when things go wrong are:

Document failure: Fail the document
Stage failure: Throw a RuntimeException or some other unhandled Error or Exception

With #305, this becomes clearer by ensuring all Exceptions are caught, while all Errors cause stage failure. Stage failure will make Core restart the stage JVM, which might or might not make the stage recover and continue processing. In some cases, the failure is due to an external service (e.g. a database or web API), and recovery is impossible for the foreseeable future. This will cause Core to restart the stage over and over, while leaving fetched-but-not-touched documents in the pipeline.

Since the current stage failure signal is that the stage JVM shuts down, we would need to introduce some other way of signaling the Core about stage failures.

The text was updated successfully, but these errors were encountered:

remen · 2014-02-03T17:09:30Z

This is how I interpret what you are saying:

Usually, when a stage tries to access some unavailable external resource, it will throw an Exception, not an Error (Error:s are more likely things like OutOfMemoryError). Hence, with the changes in #305, if an external resource is unavailable, all documents passing through the stage will be marked as failed without triggering a stage restart.

With some surrounding infrastructure and UX design (monitoring support?), it might be a good idea to let a stage communicate if it is in a "temporarily unavailable" state; at which point the core will let the documents remain on the queue until such time that the stage is "available" again.

Did I understand you correctly?

laserval · 2014-02-04T08:39:25Z

Even without the changes, stages that throw unhandled exceptions (outside init())will simply cause restarts, which means a new document will be fetched and potentially cause another stage restart, continuing forever. This leaves the documents in a fetched-not-touched state that Hydra can't recover from without manual intervention.

The changes make this less likely by instead failing the documents on exceptions, which at least keeps the documents in a useful state (and gives external systems something to react on).

Temporarily unavailable sounds nice - some way for stages to tell the Core their current status and give some indication of why they are unable to process documents. That way the stage can recover on its own, instead of requiring a Core restart or something like that.

remen · 2014-02-04T12:37:42Z

Ah, yes. I see. Messages in transit will be in inconsistent state when a processor restarts due to a failure condition. This also applies if there is a network split (i.e., the processor cannot communicate with the core).

A potential solution would be to use some sort of transactional queue system. That way, any message that fails due to a lost connection (e.g. a network split or a JVM restart) will be rolled back, and depending on settings will be retransmitted (a configurable number of times), or put on a "DEAD LETTER" queue.

laserval mentioned this issue Apr 10, 2014

The document that was processed by a stage when it crashed is left unhandled in the pipeline #328

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Method for signaling unrecoverable Stage failure #307

Method for signaling unrecoverable Stage failure #307

laserval commented Feb 3, 2014

remen commented Feb 3, 2014

laserval commented Feb 4, 2014

remen commented Feb 4, 2014

Method for signaling unrecoverable Stage failure #307

Method for signaling unrecoverable Stage failure #307

Comments

laserval commented Feb 3, 2014

remen commented Feb 3, 2014

laserval commented Feb 4, 2014

remen commented Feb 4, 2014