Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for pg_stat_replication.(write_lag|flush_lag|replay_lag) #1007

Open
spootze opened this issue Feb 26, 2024 · 3 comments
Open

Add support for pg_stat_replication.(write_lag|flush_lag|replay_lag) #1007

spootze opened this issue Feb 26, 2024 · 3 comments

Comments

@spootze
Copy link

spootze commented Feb 26, 2024

Add support for pg_stat_replication.(write_lag|flush_lag|replay_lag)

Use case. Why is this important?

For simple primary:replica setups it would be convenient to be able to monitor standby lag in seconds from the primary. pg_stat_replication already includes this information on postgres>=10, but the exporter does not parse it. AFAIK, workarounds would include

  • monitoring byte offset from pg_stat_replication.write_lsn|flush_lsn|replay_lsn), but this does not quite capture issues where the replica has not replayed a business critical change however small in bytes.
  • monitoring delay on the replica, but this would require spinning up a separate monitor only to monitor a single value on the replica.

Notes

I'm assuming the exporter does not support parsing the interval data type, which is why these metrics are marked as DISCARD here. I wonder if the following, explicit approach would be appropriate in this case:

diff --git a/cmd/postgres_exporter/queries.go b/cmd/postgres_exporter/queries.go
index fa0b5c2..e28d7b4 100644
--- a/cmd/postgres_exporter/queries.go
+++ b/cmd/postgres_exporter/queries.go
@@ -53,7 +53,10 @@ var queryOverrides = map[string][]OverrideQuery{
 			SELECT *,
 				(case pg_is_in_recovery() when 't' then null else pg_current_wal_lsn() end) AS pg_current_wal_lsn,
 				(case pg_is_in_recovery() when 't' then null else pg_wal_lsn_diff(pg_current_wal_lsn(), pg_lsn('0/0'))::float end) AS pg_current_wal_lsn_bytes,
-				(case pg_is_in_recovery() when 't' then null else pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)::float end) AS pg_wal_lsn_diff
+				(case pg_is_in_recovery() when 't' then null else pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)::float end) AS pg_wal_lsn_diff,
+				(case pg_is_in_recovery() when 't' then null else  extract(epoch from write_lag) end) as write_lag_seconds,
+				(case pg_is_in_recovery() when 't' then null else  extract(epoch from flush_lag) end) as flush_lag_seconds,
+				(case pg_is_in_recovery() when 't' then null else  extract(epoch from replay_lag) end) as replay_lag_seconds
 			FROM pg_stat_replication
 			`,
 		},

That way the deviation from the official fields is not in conflict with the source data.

@CarpathianUA
Copy link

Also interested in this, it would be great to have a replay_lag

@longtomjr
Copy link

Just a note for anyone looking into this. epoch should be extracted from the interval instead of seconds, since extracting seconds will only extract the seconds "component" of the interval:

				(case pg_is_in_recovery() when 't' then null else  extract(epoch from write_lag) end) as write_lag_seconds,
				(case pg_is_in_recovery() when 't' then null else  extract(epoch from flush_lag) end) as flush_lag_seconds,
				(case pg_is_in_recovery() when 't' then null else  extract(epoch from replay_lag) end) as replay_lag_seconds

@spootze
Copy link
Author

spootze commented May 28, 2024

Just a note for anyone looking into this. epoch should be extracted from the interval instead of seconds, since extracting seconds will only extract the seconds "component" of the interval:

Good shout - fixed in original suggestion now

bitfehler added a commit to bitfehler/postgres_exporter that referenced this issue Nov 6, 2024
The exported replication lag does not handle all failure modes, and can
report 0 for replicas that are out of sync and incapable of recovery.

A proper replacement for that metric would require a different approach
(see e.g. prometheus-community#1007), but for a lot of folks, simply exporting the age of
the last replay can provide a pretty strong signal for something being
amiss.

I think this solution might be preferrable to prometheus-community#977, though the lag
metric needs to be fixed or abandoned eventually.

Signed-off-by: Conrad Hoffmann <[email protected]>
bitfehler added a commit to bitfehler/postgres_exporter that referenced this issue Nov 6, 2024
The exported replication lag does not handle all failure modes, and can
report 0 for replicas that are out of sync and incapable of recovery.

A proper replacement for that metric would require a different approach
(see e.g. prometheus-community#1007), but for a lot of folks, simply exporting the age of
the last replay can provide a pretty strong signal for something being
amiss.

I think this solution might be preferable to prometheus-community#977, though the lag
metric needs to be fixed or abandoned eventually.

Signed-off-by: Conrad Hoffmann <[email protected]>
bitfehler added a commit to bitfehler/postgres_exporter that referenced this issue Nov 6, 2024
The exported replication lag does not handle all failure modes, and can
report 0 for replicas that are out of sync and incapable of recovery.

A proper replacement for that metric would require a different approach
(see e.g. prometheus-community#1007), but for a lot of folks, simply exporting the age of
the last replay can provide a pretty strong signal for something being
amiss.

I think this solution might be preferable to prometheus-community#977, though the lag
metric needs to be fixed or abandoned eventually.

Signed-off-by: Conrad Hoffmann <[email protected]>
bitfehler added a commit to bitfehler/postgres_exporter that referenced this issue Nov 12, 2024
The exported replication lag does not handle all failure modes, and can
report 0 for replicas that are out of sync and incapable of recovery.

A proper replacement for that metric would require a different approach
(see e.g. prometheus-community#1007), but for a lot of folks, simply exporting the age of
the last replay can provide a pretty strong signal for something being
amiss.

I think this solution might be preferable to prometheus-community#977, though the lag
metric needs to be fixed or abandoned eventually.

Signed-off-by: Conrad Hoffmann <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants