Investigate Bookkeeper disruption cases #114

adrianmo · 2018-12-20T13:41:56Z

We need to investigate and implement an action plan to cover all BookKeeper disruption cases. Some disruption cases that come to my mind are:

Graceful termination. Pod will receive an TERM signal to gracefully shutdown. Some situations that cause this are:
- kubectl drain to remove a node from the K8 cluster.
- kubectl delete pod to delete a particular Bookkeeper pod, probably accidentally.
- Scale down event.
Unexpected termination. This kind of disruptions will give pods no chance to gracefully terminate. Examples are hardware errors, VM deletion, hard evictions, etc.

For graceful terminations, we may want to run a pre-delete hook, preStop handler in K8 terminology, to make sure that ledgers stored in that Bookie are rereplicated before it is shut down. Probably by running a Bookkeeper manual recovery process.

For unexpected terminations, we may want to rely on Bookkeeper's autorecovery feature and the pod disruption budget to prevent a second pod graceful termination until the terminated pod is rescheduled and recovered.

The text was updated successfully, but these errors were encountered:

adrianmo · 2019-05-30T09:26:54Z

Changing priority after discussion with Prajakta and Srishti.

pbelgundi · 2019-09-18T06:51:31Z

Data issues caused because of a pod being unavailable for more than a certain time period should be handled by AutoRecovery.
Currently Bookkeeper auto-recovery is enabled by operator, but its execution frequency, trigger and efficiency is governed by the some configurable properties which need to be tuned to decide how long we want to tolerate data loss before kick starting the recovery process.
Note that the recovery process has its own execution overhead, so we may want to strike a balance between data loss and performance overhead the recovery process would add.
This could also vary from usecase to usecase and hence should be configurable.

For details of Auto Recovery process see: https://bookkeeper.apache.org/docs/4.7.3/admin/autorecovery/#autorecovery

Some Configuration Properties that impact auto-recovery:

auditorPeriodicBookieCheckInterval - The time interval between auditor bookie checks, in seconds. The auditor bookie check, checks ledger metadata to see which bookies should contain entries for each ledger. If a bookie that should contain entries is unavailable, then the ledger containing that entry is marked for recovery. Defaults to once per day.
auditorPeriodicCheckInterval - The time interval, in seconds, at which the auditor will check all ledgers in the cluster. By default this runs once a week. Defaults to once per week.
lostBookieRecoveryDelay - How long to wait, in seconds, before starting autorecovery of a lost bookie. Defaults to 0.
underreplicatedLedgerRecoveryGracePeriod - The grace period (in seconds) for underreplicated ledgers recovery. If ledger is marked underreplicated for more than this period then it will be reported by placementPolicyCheck in Auditor. Setting this to 0 will disable this check. Default is 0.

pbelgundi · 2020-05-06T07:56:08Z

Moved to Bookkeeper-Operator: pravega/bookkeeper-operator#39

adrianmo added kind/enhancement Enhancement of an existing feature priority/P1 Recoverable error, functionality/performance impaired but not lost, no permanent damage status/needs-investigation Further investigation is required labels Dec 20, 2018

adrianmo mentioned this issue Dec 20, 2018

Issue 60: Data protection / High availability #92

Merged

adrianmo added priority/P2 Slight inconvenience or annoyance to applications, system continues to function and removed priority/P1 Recoverable error, functionality/performance impaired but not lost, no permanent damage labels May 30, 2019

pbelgundi self-assigned this Sep 18, 2019

pbelgundi added the DR Disaster Recovery label Sep 18, 2019

pbelgundi mentioned this issue Sep 18, 2019

Ensure all ledgers are replicated in Bookie when upgrading. #235

Closed

pbelgundi closed this as completed May 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate Bookkeeper disruption cases #114

Investigate Bookkeeper disruption cases #114

adrianmo commented Dec 20, 2018

adrianmo commented May 30, 2019

pbelgundi commented Sep 18, 2019 •

edited

Loading

pbelgundi commented May 6, 2020

Investigate Bookkeeper disruption cases #114

Investigate Bookkeeper disruption cases #114

Comments

adrianmo commented Dec 20, 2018

adrianmo commented May 30, 2019

pbelgundi commented Sep 18, 2019 • edited Loading

pbelgundi commented May 6, 2020

pbelgundi commented Sep 18, 2019 •

edited

Loading