You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unable to restore backup done by PGO on s3 on another cluster, restore job run multiple times, only first restore job run successfully, other are getting failed. And the PostgresCluster's database pod doesn't start.
Environment
Please provide the following details:
Platform: Kubernetes
Platform Version: 1.29.9
PGO Image Tag: ubi8-14.8-0
Postgres Version: 14
Storage: no external volume mapped, using Kubernetes on DigitalOcean
Steps to Reproduce
REPRO
Provide steps to get to the error condition:
Install the default PGO on the cluster using the example repo.
Applied the following manifest:
apiVersion: postgres-operator.crunchydata.com/v1beta1kind: PostgresClustermetadata:
name: psd-dbnamespace: psdspec:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-14.8-0postgresVersion: 14# the data source is used to restore the database from a backup# it should match the production backup bucketdataSource:
pgbackrest:
stanza: db # this is default value, if overridden in the production cluster, it should be the sameconfiguration:
- secret:
name: do-spaces-secretglobal:
repo1-path: /pgbackrest/psd-dblog-level-console: warnlog-level-file: warnarchive-async: noptions:
- --log-level-console=inforepo:
name: repo1s3:
bucket: XXXendpoint: XXXregion: XXXinstances:
- name: instance1replicas: 1dataVolumeClaimSpec:
accessModes:
- 'ReadWriteOnce'resources:
requests:
storage: 5Giproxy:
pgBouncer:
replicas: 2patroni:
dynamicConfiguration:
postgresql:
parameters:
max_wal_size: 1GBarchive_timeout: 60wal_level: replicamax_replication_slots: 10max_wal_senders: 10users:
- name: XXXdatabases:
- XXX
- XXX
- XXX
- XXX
- XXX
- XXX
- XXX
- XXX
- XXX
- XXXoptions: 'CREATEDB,CREATEROLE'databaseInitSQL:
key: init-database.sqlname: init-database
EXPECTED
It should run the restore job once, with the backup restored in the cluster.
After the restore job it should run the PostgresCluster instance with restored backup.
ACTUAL
The restore job is running multiple times with the below behavior:
one(first) restore job with success status.
after the first restore job is completed the instance(PostgresCluster) pod starts where the database sidecar pod is crashing.
again, the restore job was created but this time, it got failed status quickly. (this job was created again and again with a failed status)
Logs
Here are the logs of the first restore job:
2025-01-10T08:44:53.007355047Z + pgbackrest restore --log-level-console=info --stanza=db --pg1-path=/pgdata/pg14 --repo=1 --delta --link-map=pg_wal=/pgdata/pg14_wal
2025-01-10T08:44:53.015093930Z WARN: unable to open log file '/pgdata/pgbackrest/log/db-restore.log': No such file or directory
2025-01-10T08:44:53.015124216Z NOTE: process will continue without log file.
2025-01-10T08:44:53.015297022Z 2025-01-10 08:44:53.015 P00 INFO: restore command begin 2.35: --delta --exec-id=9-3fc3cf46 --link-map=pg_wal=/pgdata/pg14_wal --log-level-console=info --log-level-file=off --log-path=/pgdata/pgbackrest/log --pg1-path=/pgdata/pg14 --repo=1 --repo1-path=/pgbackrest/psd-db --repo1-s3-bucket=dev-uploads --repo1-s3-endpoint=fra1.digitaloceanspaces.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=fra1 --repo1-type=s3 --spool-path=/pgdata/pgbackrest-spool --stanza=db
2025-01-10T08:44:53.015338290Z WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest'in'/pgdata/pg14' to confirm that this is a valid $PGDATA directory. --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
2025-01-10T08:44:53.165123437Z 2025-01-10 08:44:53.164 P00 INFO: repo1: restore backup set 20250108-113830F
2025-01-10T08:44:53.165162014Z 2025-01-10 08:44:53.164 P00 INFO: map link 'pg_wal' to '/pgdata/pg14_wal'
2025-01-10T08:45:43.268493956Z 2025-01-10 08:45:43.268 P00 INFO: write updated /pgdata/pg14/postgresql.auto.conf
2025-01-10T08:45:43.281644879Z 2025-01-10 08:45:43.281 P00 INFO: restore global/pg_control (performed last to ensure aborted restores cannot be started)
2025-01-10T08:45:43.286126026Z 2025-01-10 08:45:43.285 P00 INFO: restore size = 108.5MB, file total = 3909
2025-01-10T08:45:43.288405695Z 2025-01-10 08:45:43.288 P00 INFO: restore command end: completed successfully (50275ms)
2025-01-10T08:45:43.396367033Z 2025-01-10 08:45:43.396 GMT [18] LOG: starting PostgreSQL 14.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1), 64-bit
2025-01-10T08:45:43.397695537Z 2025-01-10 08:45:43.397 GMT [18] LOG: listening on IPv6 address "::1", port 5432
2025-01-10T08:45:43.397723650Z 2025-01-10 08:45:43.397 GMT [18] LOG: listening on IPv4 address "127.0.0.1", port 5432
2025-01-10T08:45:43.408259827Z 2025-01-10 08:45:43.408 GMT [18] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2025-01-10T08:45:43.418233171Z 2025-01-10 08:45:43.418 GMT [19] LOG: database system was interrupted; last known up at 2025-01-08 11:38:40 GMT
2025-01-10T08:45:43.650204584Z 2025-01-10 08:45:43.649 GMT [19] LOG: starting archive recovery
2025-01-10T08:45:43.776783777Z 2025-01-10 08:45:43.776 GMT [19] LOG: restored log file "000000010000000000000009" from archive
2025-01-10T08:45:43.847057630Z 2025-01-10 08:45:43.846 GMT [19] LOG: redo starts at 0/9000028
2025-01-10T08:45:43.974492277Z 2025-01-10 08:45:43.974 GMT [19] LOG: restored log file "00000001000000000000000A" from archive
2025-01-10T08:45:44.028786382Z 2025-01-10 08:45:44.028 GMT [19] LOG: consistent recovery state reached at 0/A000050
2025-01-10T08:45:44.029749340Z 2025-01-10 08:45:44.029 GMT [18] LOG: database system is ready to accept read-only connections
2025-01-10T08:45:44.174042779Z 2025-01-10 08:45:44.173 GMT [19] LOG: restored log file "00000001000000000000000B" from archive
2025-01-10T08:45:44.282325706Z 2025-01-10 08:45:44.282 GMT [19] LOG: redo done at 0/B000060 system usage: CPU: user: 0.00 s, system: 0.01 s, elapsed: 0.43 s
2025-01-10T08:45:44.421559387Z 2025-01-10 08:45:44.421 GMT [19] LOG: restored log file "00000001000000000000000B" from archive
2025-01-10T08:45:44.523718135Z 2025-01-10 08:45:44.523 GMT [19] LOG: selected new timeline ID: 2
2025-01-10T08:45:44.599470876Z 2025-01-10 08:45:44.599 GMT [19] LOG: archive recovery complete
2025-01-10T08:45:44.696744674Z 2025-01-10 08:45:44.696 GMT [18] LOG: database system is ready to accept connections
2025-01-10T08:45:44.700933278Z 2025-01-10 08:45:44.700 GMT [37] LOG: archive command failed with exit code 1
2025-01-10T08:45:44.700965957Z 2025-01-10 08:45:44.700 GMT [37] DETAIL: The failed archive command was: false
2025-01-10T08:45:45.706563634Z 2025-01-10 08:45:45.706 GMT [37] LOG: archive command failed with exit code 1
2025-01-10T08:45:45.706600221Z 2025-01-10 08:45:45.706 GMT [37] DETAIL: The failed archive command was: false
2025-01-10T08:45:46.119127842Z 2025-01-10 08:45:46.118 GMT [18] LOG: received fast shutdown request
2025-01-10T08:45:46.124803040Z 2025-01-10 08:45:46.124 GMT [18] LOG: aborting any active transactions
2025-01-10T08:45:46.127175148Z 2025-01-10 08:45:46.126 GMT [18] LOG: background worker "logical replication launcher" (PID 38) exited with exit code 1
2025-01-10T08:45:46.127264734Z 2025-01-10 08:45:46.127 GMT [22] LOG: shutting down
2025-01-10T08:45:46.208225879Z 2025-01-10 08:45:46.208 GMT [37] LOG: archive command failed with exit code 1
2025-01-10T08:45:46.208258982Z 2025-01-10 08:45:46.208 GMT [37] DETAIL: The failed archive command was: false
2025-01-10T08:45:46.208265020Z 2025-01-10 08:45:46.208 GMT [37] WARNING: archiving write-ahead log file "00000002.history" failed too many times, will try again later
2025-01-10T08:45:46.213014042Z 2025-01-10 08:45:46.212 GMT [37] LOG: archive command failed with exit code 1
2025-01-10T08:45:46.213396964Z 2025-01-10 08:45:46.212 GMT [37] DETAIL: The failed archive command was: false
2025-01-10T08:45:47.218680591Z 2025-01-10 08:45:47.218 GMT [37] LOG: archive command failed with exit code 1
2025-01-10T08:45:47.218712495Z 2025-01-10 08:45:47.218 GMT [37] DETAIL: The failed archive command was: false
2025-01-10T08:45:48.224892246Z 2025-01-10 08:45:48.224 GMT [37] LOG: archive command failed with exit code 1
2025-01-10T08:45:48.224922097Z 2025-01-10 08:45:48.224 GMT [37] DETAIL: The failed archive command was: false
2025-01-10T08:45:48.224926823Z 2025-01-10 08:45:48.224 GMT [37] WARNING: archiving write-ahead log file "00000002.history" failed too many times, will try again later
2025-01-10T08:45:48.229058316Z 2025-01-10 08:45:48.228 GMT [18] LOG: database system is shut down
Here are the database sidecar pod logs(that started after the first job completion):
2025-01-10 11:11:20,619 INFO: No PostgreSQL configuration items changed, nothing to reload.
2025-01-10 11:11:20,628 INFO: Lock owner: None; I am psd-db-instance1-6879-0
2025-01-10 11:11:20,800 INFO: trying to bootstrap a new cluster
2025-01-10 11:11:20,800 INFO: Running custom bootstrap script: mv "/pgdata/pg14_bootstrap""/pgdata/pg14"
mv: cannot stat '/pgdata/pg14_bootstrap': No such file or directory
2025-01-10 11:11:20,813 INFO: removing initialize key after failed attempt to bootstrap the cluster
Traceback (most recent call last):
File "/usr/local/bin/patroni", line 11, in<module>sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/patroni/__main__.py", line 144, in main
returnpatroni_main()
File "/usr/local/lib/python3.6/site-packages/patroni/__main__.py", line 136, in patroni_main
abstract_main(Patroni, schema)
File "/usr/local/lib/python3.6/site-packages/patroni/daemon.py", line 108, in abstract_main
controller.run()
File "/usr/local/lib/python3.6/site-packages/patroni/__main__.py", line 106, in run
super(Patroni, self).run()
File "/usr/local/lib/python3.6/site-packages/patroni/daemon.py", line 65, in run
self._run_cycle()
File "/usr/local/lib/python3.6/site-packages/patroni/__main__.py", line 109, in _run_cycle
logger.info(self.ha.run_cycle())
File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1558, in run_cycle
info = self._run_cycle()
File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1416, in _run_cycle
returnself.post_bootstrap()
File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1308, in post_bootstrap
self.cancel_initialization()
File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1301, in cancel_initialization
raise PatroniFatalException('Failed to bootstrap cluster')
Here are the logs of all repeated restore jobs ran after the first restore job and instance pod creation:
+ pgbackrest restore --log-level-console=info --stanza=db --pg1-path=/pgdata/pg14 --repo=1 --delta --link-map=pg_wal=/pgdata/pg14_wal
2025-01-10 08:47:17.112 P00 INFO: restore command begin 2.35: --delta --exec-id=9-9852e2d4 --link-map=pg_wal=/pgdata/pg14_wal --log-level-console=info --log-level-file=warn --log-path=/pgdata/pgbackrest/log --pg1-path=/pgdata/pg14 --repo=1 --repo1-path=/pgbackrest/psd-db --repo1-s3-bucket=dev-uploads --repo1-s3-endpoint=fra1.digitaloceanspaces.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=fra1 --repo1-type=s3 --spool-path=/pgdata/pgbackrest-spool --stanza=db
WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest'in'/pgdata/pg14' to confirm that this is a valid $PGDATA directory. --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
2025-01-10 08:47:17.243 P00 INFO: repo1: restore backup set 20250108-113830F
2025-01-10 08:47:17.243 P00 INFO: map link 'pg_wal' to '/pgdata/pg14_wal'
ERROR: [040]: unable to restore to path '/pgdata/pg14_wal' because it contains files
HINT: try using --delta if this is what you intended.
2025-01-10 08:47:17.244 P00 INFO: restore command end: aborted with exception [040]
Additional Information
All the secrets are injected properly and correctly.
We have multiple databases on one cluster and managed by a single DB user.
The text was updated successfully, but these errors were encountered:
Overview
Unable to restore backup done by PGO on s3 on another cluster, restore job run multiple times, only first restore job run successfully, other are getting failed. And the PostgresCluster's database pod doesn't start.
Environment
Please provide the following details:
Kubernetes
Steps to Reproduce
REPRO
Provide steps to get to the error condition:
EXPECTED
ACTUAL
The restore job is running multiple times with the below behavior:
Logs
Here are the logs of the first restore job:
Here are the database sidecar pod logs(that started after the first job completion):
Here are the logs of all repeated restore jobs ran after the first restore job and instance pod creation:
Additional Information
The text was updated successfully, but these errors were encountered: