Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(Monitoring): Adding Monitoring for Disk Space and Number of Backups #7404

Merged
merged 6 commits into from
Feb 9, 2021

Conversation

gja
Copy link
Contributor

@gja gja commented Feb 7, 2021


This change is Reviewable

Copy link
Contributor

@danielmai danielmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r1.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @gja, @manishrjain, and @vvbalaji-dgraph)


x/disk_metrics_others.go, line 12 at r1 (raw file):

"File System Metrics are not currently supported on non-linux platforms"

File system metrics are not currently supported on non-Linux platforms

Suggestion: change the capitalization.


x/metrics.go, line 66 at r1 (raw file):

	// NumBackupsSuccess is the number of backups successfully completed
	NumBackupsSuccess = stats.Int64("num_backups_success",
		"Total number of backups completed", stats.UnitDimensionless)

We should add a metric for backup failures too so we can set up alerts for when backups fail. It can happen for whatever reason (context timeouts, EOF errors). Backup failures can't be caught on the client-side if the backup request takes too long, causing the client request to timeout while the backup is still pending on the Dgraph side.


x/metrics.go, line 440 at r1 (raw file):

	// Exposing metrics at /metrics, which is the usual standard, as well as at the old endpoint
	http.Handle("/metrics", pe)
	http.Handle("/debug/prometheus_metrics", pe)

Should /debug/prometheus_metrics be a 301 redirect to /metrics? Would a redirect still work for existing Prometheus configs that refer to /debug/prometheus_metrics?

Copy link
Contributor

@danielmai danielmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @gja, @manishrjain, and @vvbalaji-dgraph)


x/metrics.go, line 66 at r1 (raw file):

		"Total number of backups requested", stats.UnitDimensionless)
	// NumBackupsSuccess is the number of backups successfully completed
	NumBackupsSuccess = stats.Int64("num_backups_success",

The suffix for this metric should be _total to follow Prometheus's naming convention.

Copy link
Contributor

@danielmai danielmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @gja, @manishrjain, and @vvbalaji-dgraph)


dgraph/cmd/zero/run.go, line 346 at r1 (raw file):

zero_wal_dir

We can call this wal. It's already known to be Zero's wal directory since this metric is coming from Zero.


worker/server_state.go, line 136 at r1 (raw file):

	// Commenting this out because Badger is doing its own cache checks.
	go x.MonitorCacheHealth(s.Pstore, s.gcCloser)
	go x.MonitorDiskMetrics("posting_dir", Config.PostingDir, s.gcCloser)

posting_dir can be renamed to postings. This follows the long-form flag name for setting the p directory ("postings").


x/metrics.go, line 96 at r1 (raw file):

disk_free

This should be called disk_free_bytes


x/metrics.go, line 99 at r1 (raw file):

disk_used

This should be called disk_used_bytes


x/metrics.go, line 102 at r1 (raw file):

disk_total

This should be called disk_bytes_total


x/metrics.go, line 163 at r1 (raw file):

dir_type

This can be dir. This is the same naming scheme used in Badger metrics. e.g., https://github.com/dgraph-io/dgraph/blob/d79d3797bf90d9bb5a57b434a741302962ea5d84/x/metrics.go#L424-L428

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Defer to @danielmai for final approval.

Reviewed 6 of 6 files at r2.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @danielmai, @gja, and @vvbalaji-dgraph)


worker/server_state.go, line 136 at r1 (raw file):

Previously, danielmai (Daniel Mai) wrote…

posting_dir can be renamed to postings. This follows the long-form flag name for setting the p directory ("postings").

postings-fs so it's clear it's at a filesystem level.

@gja gja merged commit 07046c6 into master Feb 9, 2021
@gja gja deleted the tejas/moar-metrics-much-amaze branch February 9, 2021 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants