Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Bookkeeper disruption cases #114

Closed
adrianmo opened this issue Dec 20, 2018 · 3 comments
Closed

Investigate Bookkeeper disruption cases #114

adrianmo opened this issue Dec 20, 2018 · 3 comments
Assignees
Labels
DR Disaster Recovery kind/enhancement Enhancement of an existing feature priority/P2 Slight inconvenience or annoyance to applications, system continues to function status/needs-investigation Further investigation is required

Comments

@adrianmo
Copy link
Contributor

We need to investigate and implement an action plan to cover all BookKeeper disruption cases. Some disruption cases that come to my mind are:

  • Graceful termination. Pod will receive an TERM signal to gracefully shutdown. Some situations that cause this are:
    • kubectl drain to remove a node from the K8 cluster.
    • kubectl delete pod to delete a particular Bookkeeper pod, probably accidentally.
    • Scale down event.
  • Unexpected termination. This kind of disruptions will give pods no chance to gracefully terminate. Examples are hardware errors, VM deletion, hard evictions, etc.

For graceful terminations, we may want to run a pre-delete hook, preStop handler in K8 terminology, to make sure that ledgers stored in that Bookie are rereplicated before it is shut down. Probably by running a Bookkeeper manual recovery process.

For unexpected terminations, we may want to rely on Bookkeeper's autorecovery feature and the pod disruption budget to prevent a second pod graceful termination until the terminated pod is rescheduled and recovered.

@adrianmo adrianmo added kind/enhancement Enhancement of an existing feature priority/P1 Recoverable error, functionality/performance impaired but not lost, no permanent damage status/needs-investigation Further investigation is required labels Dec 20, 2018
@adrianmo adrianmo added priority/P2 Slight inconvenience or annoyance to applications, system continues to function and removed priority/P1 Recoverable error, functionality/performance impaired but not lost, no permanent damage labels May 30, 2019
@adrianmo
Copy link
Contributor Author

Changing priority after discussion with Prajakta and Srishti.

@pbelgundi pbelgundi self-assigned this Sep 18, 2019
@pbelgundi pbelgundi added the DR Disaster Recovery label Sep 18, 2019
@pbelgundi
Copy link
Contributor

pbelgundi commented Sep 18, 2019

Data issues caused because of a pod being unavailable for more than a certain time period should be handled by AutoRecovery.
Currently Bookkeeper auto-recovery is enabled by operator, but its execution frequency, trigger and efficiency is governed by the some configurable properties which need to be tuned to decide how long we want to tolerate data loss before kick starting the recovery process.
Note that the recovery process has its own execution overhead, so we may want to strike a balance between data loss and performance overhead the recovery process would add.
This could also vary from usecase to usecase and hence should be configurable.

For details of Auto Recovery process see: https://bookkeeper.apache.org/docs/4.7.3/admin/autorecovery/#autorecovery

Some Configuration Properties that impact auto-recovery:

  1. auditorPeriodicBookieCheckInterval - The time interval between auditor bookie checks, in seconds. The auditor bookie check, checks ledger metadata to see which bookies should contain entries for each ledger. If a bookie that should contain entries is unavailable, then the ledger containing that entry is marked for recovery. Defaults to once per day.
  2. auditorPeriodicCheckInterval - The time interval, in seconds, at which the auditor will check all ledgers in the cluster. By default this runs once a week. Defaults to once per week.
  3. lostBookieRecoveryDelay - How long to wait, in seconds, before starting autorecovery of a lost bookie. Defaults to 0.
  4. underreplicatedLedgerRecoveryGracePeriod - The grace period (in seconds) for underreplicated ledgers recovery. If ledger is marked underreplicated for more than this period then it will be reported by placementPolicyCheck in Auditor. Setting this to 0 will disable this check. Default is 0.

@pbelgundi
Copy link
Contributor

Moved to Bookkeeper-Operator: pravega/bookkeeper-operator#39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DR Disaster Recovery kind/enhancement Enhancement of an existing feature priority/P2 Slight inconvenience or annoyance to applications, system continues to function status/needs-investigation Further investigation is required
Projects
None yet
Development

No branches or pull requests

2 participants