Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to restore from S3 for postgres 14 #4068

Open
dakshlohar1 opened this issue Jan 10, 2025 · 1 comment
Open

unable to restore from S3 for postgres 14 #4068

dakshlohar1 opened this issue Jan 10, 2025 · 1 comment

Comments

@dakshlohar1
Copy link

dakshlohar1 commented Jan 10, 2025

Overview

Unable to restore backup done by PGO on s3 on another cluster, restore job run multiple times, only first restore job run successfully, other are getting failed. And the PostgresCluster's database pod doesn't start.

Environment

Please provide the following details:

  • Platform: Kubernetes
  • Platform Version: 1.29.9
  • PGO Image Tag: ubi8-14.8-0
  • Postgres Version: 14
  • Storage: no external volume mapped, using Kubernetes on DigitalOcean

Steps to Reproduce

REPRO

Provide steps to get to the error condition:

  1. Install the default PGO on the cluster using the example repo.
  2. Applied the following manifest:
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: psd-db
  namespace: psd
spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-14.8-0
  postgresVersion: 14
  # the data source is used to restore the database from a backup
  # it should match the production backup bucket
  dataSource:
    pgbackrest:
      stanza: db # this is default value, if overridden in the production cluster, it should be the same
      configuration:
        - secret:
            name: do-spaces-secret
      global:
        repo1-path: /pgbackrest/psd-db
        log-level-console: warn
        log-level-file: warn
        archive-async: n
      options:
        - --log-level-console=info
      repo:
        name: repo1
        s3:
          bucket: XXX
          endpoint: XXX
          region: XXX
  instances:
    - name: instance1
      replicas: 1
      dataVolumeClaimSpec:
        accessModes:
          - 'ReadWriteOnce'
        resources:
          requests:
            storage: 5Gi
  proxy:
    pgBouncer:
      replicas: 2
  patroni:
    dynamicConfiguration:
      postgresql:
        parameters:
          max_wal_size: 1GB
          archive_timeout: 60
          wal_level: replica
          max_replication_slots: 10
          max_wal_senders: 10
  users:
    - name: XXX
      databases:
        - XXX
        - XXX
        - XXX
        - XXX
        - XXX
        - XXX
        - XXX
        - XXX
        - XXX
        - XXX
      options: 'CREATEDB,CREATEROLE'
  databaseInitSQL:
    key: init-database.sql
    name: init-database

EXPECTED

  1. It should run the restore job once, with the backup restored in the cluster.
  2. After the restore job it should run the PostgresCluster instance with restored backup.

ACTUAL

The restore job is running multiple times with the below behavior:

  • one(first) restore job with success status.
  • after the first restore job is completed the instance(PostgresCluster) pod starts where the database sidecar pod is crashing.
  • again, the restore job was created but this time, it got failed status quickly. (this job was created again and again with a failed status)

Logs

Here are the logs of the first restore job:

2025-01-10T08:44:53.007355047Z + pgbackrest restore --log-level-console=info --stanza=db --pg1-path=/pgdata/pg14 --repo=1 --delta --link-map=pg_wal=/pgdata/pg14_wal
2025-01-10T08:44:53.015093930Z WARN: unable to open log file '/pgdata/pgbackrest/log/db-restore.log': No such file or directory
2025-01-10T08:44:53.015124216Z       NOTE: process will continue without log file.
2025-01-10T08:44:53.015297022Z 2025-01-10 08:44:53.015 P00   INFO: restore command begin 2.35: --delta --exec-id=9-3fc3cf46 --link-map=pg_wal=/pgdata/pg14_wal --log-level-console=info --log-level-file=off --log-path=/pgdata/pgbackrest/log --pg1-path=/pgdata/pg14 --repo=1 --repo1-path=/pgbackrest/psd-db --repo1-s3-bucket=dev-uploads --repo1-s3-endpoint=fra1.digitaloceanspaces.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=fra1 --repo1-type=s3 --spool-path=/pgdata/pgbackrest-spool --stanza=db
2025-01-10T08:44:53.015338290Z WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/pgdata/pg14' to confirm that this is a valid $PGDATA directory.  --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
2025-01-10T08:44:53.165123437Z 2025-01-10 08:44:53.164 P00   INFO: repo1: restore backup set 20250108-113830F
2025-01-10T08:44:53.165162014Z 2025-01-10 08:44:53.164 P00   INFO: map link 'pg_wal' to '/pgdata/pg14_wal'
2025-01-10T08:45:43.268493956Z 2025-01-10 08:45:43.268 P00   INFO: write updated /pgdata/pg14/postgresql.auto.conf
2025-01-10T08:45:43.281644879Z 2025-01-10 08:45:43.281 P00   INFO: restore global/pg_control (performed last to ensure aborted restores cannot be started)
2025-01-10T08:45:43.286126026Z 2025-01-10 08:45:43.285 P00   INFO: restore size = 108.5MB, file total = 3909
2025-01-10T08:45:43.288405695Z 2025-01-10 08:45:43.288 P00   INFO: restore command end: completed successfully (50275ms)
2025-01-10T08:45:43.396367033Z 2025-01-10 08:45:43.396 GMT [18] LOG:  starting PostgreSQL 14.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1), 64-bit
2025-01-10T08:45:43.397695537Z 2025-01-10 08:45:43.397 GMT [18] LOG:  listening on IPv6 address "::1", port 5432
2025-01-10T08:45:43.397723650Z 2025-01-10 08:45:43.397 GMT [18] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2025-01-10T08:45:43.408259827Z 2025-01-10 08:45:43.408 GMT [18] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2025-01-10T08:45:43.418233171Z 2025-01-10 08:45:43.418 GMT [19] LOG:  database system was interrupted; last known up at 2025-01-08 11:38:40 GMT
2025-01-10T08:45:43.650204584Z 2025-01-10 08:45:43.649 GMT [19] LOG:  starting archive recovery
2025-01-10T08:45:43.776783777Z 2025-01-10 08:45:43.776 GMT [19] LOG:  restored log file "000000010000000000000009" from archive
2025-01-10T08:45:43.847057630Z 2025-01-10 08:45:43.846 GMT [19] LOG:  redo starts at 0/9000028
2025-01-10T08:45:43.974492277Z 2025-01-10 08:45:43.974 GMT [19] LOG:  restored log file "00000001000000000000000A" from archive
2025-01-10T08:45:44.028786382Z 2025-01-10 08:45:44.028 GMT [19] LOG:  consistent recovery state reached at 0/A000050
2025-01-10T08:45:44.029749340Z 2025-01-10 08:45:44.029 GMT [18] LOG:  database system is ready to accept read-only connections
2025-01-10T08:45:44.174042779Z 2025-01-10 08:45:44.173 GMT [19] LOG:  restored log file "00000001000000000000000B" from archive
2025-01-10T08:45:44.282325706Z 2025-01-10 08:45:44.282 GMT [19] LOG:  redo done at 0/B000060 system usage: CPU: user: 0.00 s, system: 0.01 s, elapsed: 0.43 s
2025-01-10T08:45:44.421559387Z 2025-01-10 08:45:44.421 GMT [19] LOG:  restored log file "00000001000000000000000B" from archive
2025-01-10T08:45:44.523718135Z 2025-01-10 08:45:44.523 GMT [19] LOG:  selected new timeline ID: 2
2025-01-10T08:45:44.599470876Z 2025-01-10 08:45:44.599 GMT [19] LOG:  archive recovery complete
2025-01-10T08:45:44.696744674Z 2025-01-10 08:45:44.696 GMT [18] LOG:  database system is ready to accept connections
2025-01-10T08:45:44.700933278Z 2025-01-10 08:45:44.700 GMT [37] LOG:  archive command failed with exit code 1
2025-01-10T08:45:44.700965957Z 2025-01-10 08:45:44.700 GMT [37] DETAIL:  The failed archive command was: false
2025-01-10T08:45:45.706563634Z 2025-01-10 08:45:45.706 GMT [37] LOG:  archive command failed with exit code 1
2025-01-10T08:45:45.706600221Z 2025-01-10 08:45:45.706 GMT [37] DETAIL:  The failed archive command was: false
2025-01-10T08:45:46.119127842Z 2025-01-10 08:45:46.118 GMT [18] LOG:  received fast shutdown request
2025-01-10T08:45:46.124803040Z 2025-01-10 08:45:46.124 GMT [18] LOG:  aborting any active transactions
2025-01-10T08:45:46.127175148Z 2025-01-10 08:45:46.126 GMT [18] LOG:  background worker "logical replication launcher" (PID 38) exited with exit code 1
2025-01-10T08:45:46.127264734Z 2025-01-10 08:45:46.127 GMT [22] LOG:  shutting down
2025-01-10T08:45:46.208225879Z 2025-01-10 08:45:46.208 GMT [37] LOG:  archive command failed with exit code 1
2025-01-10T08:45:46.208258982Z 2025-01-10 08:45:46.208 GMT [37] DETAIL:  The failed archive command was: false
2025-01-10T08:45:46.208265020Z 2025-01-10 08:45:46.208 GMT [37] WARNING:  archiving write-ahead log file "00000002.history" failed too many times, will try again later
2025-01-10T08:45:46.213014042Z 2025-01-10 08:45:46.212 GMT [37] LOG:  archive command failed with exit code 1
2025-01-10T08:45:46.213396964Z 2025-01-10 08:45:46.212 GMT [37] DETAIL:  The failed archive command was: false
2025-01-10T08:45:47.218680591Z 2025-01-10 08:45:47.218 GMT [37] LOG:  archive command failed with exit code 1
2025-01-10T08:45:47.218712495Z 2025-01-10 08:45:47.218 GMT [37] DETAIL:  The failed archive command was: false
2025-01-10T08:45:48.224892246Z 2025-01-10 08:45:48.224 GMT [37] LOG:  archive command failed with exit code 1
2025-01-10T08:45:48.224922097Z 2025-01-10 08:45:48.224 GMT [37] DETAIL:  The failed archive command was: false
2025-01-10T08:45:48.224926823Z 2025-01-10 08:45:48.224 GMT [37] WARNING:  archiving write-ahead log file "00000002.history" failed too many times, will try again later
2025-01-10T08:45:48.229058316Z 2025-01-10 08:45:48.228 GMT [18] LOG:  database system is shut down

Here are the database sidecar pod logs(that started after the first job completion):

2025-01-10 11:11:20,619 INFO: No PostgreSQL configuration items changed, nothing to reload.
2025-01-10 11:11:20,628 INFO: Lock owner: None; I am psd-db-instance1-6879-0
2025-01-10 11:11:20,800 INFO: trying to bootstrap a new cluster
2025-01-10 11:11:20,800 INFO: Running custom bootstrap script: mv "/pgdata/pg14_bootstrap" "/pgdata/pg14"
mv: cannot stat '/pgdata/pg14_bootstrap': No such file or directory
2025-01-10 11:11:20,813 INFO: removing initialize key after failed attempt to bootstrap the cluster
Traceback (most recent call last):
  File "/usr/local/bin/patroni", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/patroni/__main__.py", line 144, in main
    return patroni_main()
  File "/usr/local/lib/python3.6/site-packages/patroni/__main__.py", line 136, in patroni_main
    abstract_main(Patroni, schema)
  File "/usr/local/lib/python3.6/site-packages/patroni/daemon.py", line 108, in abstract_main
    controller.run()
  File "/usr/local/lib/python3.6/site-packages/patroni/__main__.py", line 106, in run
    super(Patroni, self).run()
  File "/usr/local/lib/python3.6/site-packages/patroni/daemon.py", line 65, in run
    self._run_cycle()
  File "/usr/local/lib/python3.6/site-packages/patroni/__main__.py", line 109, in _run_cycle
    logger.info(self.ha.run_cycle())
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1558, in run_cycle
    info = self._run_cycle()
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1416, in _run_cycle
    return self.post_bootstrap()
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1308, in post_bootstrap
    self.cancel_initialization()
  File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1301, in cancel_initialization
    raise PatroniFatalException('Failed to bootstrap cluster')

Here are the logs of all repeated restore jobs ran after the first restore job and instance pod creation:

+ pgbackrest restore --log-level-console=info --stanza=db --pg1-path=/pgdata/pg14 --repo=1 --delta --link-map=pg_wal=/pgdata/pg14_wal
2025-01-10 08:47:17.112 P00   INFO: restore command begin 2.35: --delta --exec-id=9-9852e2d4 --link-map=pg_wal=/pgdata/pg14_wal --log-level-console=info --log-level-file=warn --log-path=/pgdata/pgbackrest/log --pg1-path=/pgdata/pg14 --repo=1 --repo1-path=/pgbackrest/psd-db --repo1-s3-bucket=dev-uploads --repo1-s3-endpoint=fra1.digitaloceanspaces.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=fra1 --repo1-type=s3 --spool-path=/pgdata/pgbackrest-spool --stanza=db
WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/pgdata/pg14' to confirm that this is a valid $PGDATA directory.  --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
2025-01-10 08:47:17.243 P00   INFO: repo1: restore backup set 20250108-113830F
2025-01-10 08:47:17.243 P00   INFO: map link 'pg_wal' to '/pgdata/pg14_wal'
ERROR: [040]: unable to restore to path '/pgdata/pg14_wal' because it contains files
       HINT: try using --delta if this is what you intended.
2025-01-10 08:47:17.244 P00   INFO: restore command end: aborted with exception [040]

Additional Information

  • All the secrets are injected properly and correctly.
  • We have multiple databases on one cluster and managed by a single DB user.
@dajeffers
Copy link
Contributor

From what I can tell, this tends to happen when using a dataSource with backups disabled. Enabling backups allows the database to start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants