Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method for signaling unrecoverable Stage failure #307

Open
laserval opened this issue Feb 3, 2014 · 3 comments
Open

Method for signaling unrecoverable Stage failure #307

laserval opened this issue Feb 3, 2014 · 3 comments

Comments

@laserval
Copy link
Contributor

laserval commented Feb 3, 2014

Currently, Stages can signal that the parameters set in their configuration don't work on startup. After a few tries, Core will stop starting the stage and log that.

The same can't be done when the Stage is already running; the only things a Stage can do when things go wrong are:

  • Document failure: Fail the document
  • Stage failure: Throw a RuntimeException or some other unhandled Error or Exception

With #305, this becomes clearer by ensuring all Exceptions are caught, while all Errors cause stage failure. Stage failure will make Core restart the stage JVM, which might or might not make the stage recover and continue processing. In some cases, the failure is due to an external service (e.g. a database or web API), and recovery is impossible for the foreseeable future. This will cause Core to restart the stage over and over, while leaving fetched-but-not-touched documents in the pipeline.

Since the current stage failure signal is that the stage JVM shuts down, we would need to introduce some other way of signaling the Core about stage failures.

@remen
Copy link
Contributor

remen commented Feb 3, 2014

This is how I interpret what you are saying:

Usually, when a stage tries to access some unavailable external resource, it will throw an Exception, not an Error (Error:s are more likely things like OutOfMemoryError). Hence, with the changes in #305, if an external resource is unavailable, all documents passing through the stage will be marked as failed without triggering a stage restart.

With some surrounding infrastructure and UX design (monitoring support?), it might be a good idea to let a stage communicate if it is in a "temporarily unavailable" state; at which point the core will let the documents remain on the queue until such time that the stage is "available" again.

Did I understand you correctly?

@laserval
Copy link
Contributor Author

laserval commented Feb 4, 2014

Even without the changes, stages that throw unhandled exceptions (outside init())will simply cause restarts, which means a new document will be fetched and potentially cause another stage restart, continuing forever. This leaves the documents in a fetched-not-touched state that Hydra can't recover from without manual intervention.

The changes make this less likely by instead failing the documents on exceptions, which at least keeps the documents in a useful state (and gives external systems something to react on).

Temporarily unavailable sounds nice - some way for stages to tell the Core their current status and give some indication of why they are unable to process documents. That way the stage can recover on its own, instead of requiring a Core restart or something like that.

@remen
Copy link
Contributor

remen commented Feb 4, 2014

Ah, yes. I see. Messages in transit will be in inconsistent state when a processor restarts due to a failure condition. This also applies if there is a network split (i.e., the processor cannot communicate with the core).

A potential solution would be to use some sort of transactional queue system. That way, any message that fails due to a lost connection (e.g. a network split or a JVM restart) will be rolled back, and depending on settings will be retransmitted (a configurable number of times), or put on a "DEAD LETTER" queue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants