-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Method for signaling unrecoverable Stage failure #307
Comments
This is how I interpret what you are saying: Usually, when a stage tries to access some unavailable external resource, it will throw an With some surrounding infrastructure and UX design (monitoring support?), it might be a good idea to let a stage communicate if it is in a "temporarily unavailable" state; at which point the core will let the documents remain on the queue until such time that the stage is "available" again. Did I understand you correctly? |
Even without the changes, stages that throw unhandled exceptions (outside The changes make this less likely by instead failing the documents on exceptions, which at least keeps the documents in a useful state (and gives external systems something to react on). Temporarily unavailable sounds nice - some way for stages to tell the Core their current status and give some indication of why they are unable to process documents. That way the stage can recover on its own, instead of requiring a Core restart or something like that. |
Ah, yes. I see. Messages in transit will be in inconsistent state when a processor restarts due to a failure condition. This also applies if there is a network split (i.e., the processor cannot communicate with the core). A potential solution would be to use some sort of transactional queue system. That way, any message that fails due to a lost connection (e.g. a network split or a JVM restart) will be rolled back, and depending on settings will be retransmitted (a configurable number of times), or put on a "DEAD LETTER" queue. |
Currently, Stages can signal that the parameters set in their configuration don't work on startup. After a few tries, Core will stop starting the stage and log that.
The same can't be done when the Stage is already running; the only things a Stage can do when things go wrong are:
With #305, this becomes clearer by ensuring all
Exception
s are caught, while allError
s cause stage failure. Stage failure will make Core restart the stage JVM, which might or might not make the stage recover and continue processing. In some cases, the failure is due to an external service (e.g. a database or web API), and recovery is impossible for the foreseeable future. This will cause Core to restart the stage over and over, while leaving fetched-but-not-touched documents in the pipeline.Since the current stage failure signal is that the stage JVM shuts down, we would need to introduce some other way of signaling the Core about stage failures.
The text was updated successfully, but these errors were encountered: