Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New metric: CPU usage #5545

Merged
merged 2 commits into from
Jan 5, 2018
Merged

Conversation

kvch
Copy link
Contributor

@kvch kvch commented Nov 8, 2017

  • Add both percentage and normalized percentage of CPU.
  • Move Round function to common.


dTime := sample.time.Sub(lastSample.time)
dMilli := dTime / time.Millisecond
dCpu := int64(sample.procTimes.Total - lastSample.procTimes.Total)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var dCpu should be dCPU

monitoring.ReportFloat(V, "usage.normalized", normalizedCpu)
}

func getCpuUsage() (float64, float64) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func getCpuUsage should be getCPUUsage

V.OnRegistryStart()
defer V.OnRegistryFinished()

cpuUsage, normalizedCpu := getCpuUsage()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var normalizedCpu should be normalizedCPU

@@ -36,3 +56,38 @@ func reportInfo(_ monitoring.Mode, V monitoring.Visitor) {
uptime := int64(delta.Seconds())*1000 + int64(delta.Nanoseconds()/1000000)
monitoring.ReportInt(V, "uptime.ms", uptime)
}

func reportCpu(_ monitoring.Mode, V monitoring.Visitor) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func reportCpu should be reportCPU


dTime := sample.time.Sub(lastSample.time)
dMilli := dTime / time.Millisecond
dCpu := int64(sample.procTimes.Total - lastSample.procTimes.Total)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var dCpu should be dCPU

monitoring.ReportFloat(V, "usage.normalized", normalizedCpu)
}

func getCpuUsage() (float64, float64) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func getCpuUsage should be getCPUUsage

V.OnRegistryStart()
defer V.OnRegistryFinished()

cpuUsage, normalizedCpu := getCpuUsage()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var normalizedCpu should be normalizedCPU

@@ -36,3 +56,38 @@ func reportInfo(_ monitoring.Mode, V monitoring.Visitor) {
uptime := int64(delta.Seconds())*1000 + int64(delta.Nanoseconds()/1000000)
monitoring.ReportInt(V, "uptime.ms", uptime)
}

func reportCpu(_ monitoring.Mode, V monitoring.Visitor) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func reportCpu should be reportCPU


dTime := sample.time.Sub(lastSample.time)
dMilli := dTime / time.Millisecond
dCpu := int64(sample.procTimes.Total - lastSample.procTimes.Total)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var dCpu should be dCPU

monitoring.ReportFloat(V, "usage.normalized", normalizedCpu)
}

func getCpuUsage() (float64, float64) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func getCpuUsage should be getCPUUsage

V.OnRegistryStart()
defer V.OnRegistryFinished()

cpuUsage, normalizedCpu := getCpuUsage()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var normalizedCpu should be normalizedCPU

@@ -36,3 +56,38 @@ func reportInfo(_ monitoring.Mode, V monitoring.Visitor) {
uptime := int64(delta.Seconds())*1000 + int64(delta.Nanoseconds()/1000000)
monitoring.ReportInt(V, "uptime.ms", uptime)
}

func reportCpu(_ monitoring.Mode, V monitoring.Visitor) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func reportCpu should be reportCPU

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes concerning the double struct and the value of the cpu usage.

}{
time.Now(),
sigar.ProcTime{},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using the same anonymized struct in two different places in the code We can make a concrete struct maybe called CpuSample and reuse it or we could also encapsulate the creation of the sample into his own method and return the sample or error.

dCPU := int64(sample.procTimes.Total - lastSample.procTimes.Total)

usage := float64(dCPU) / float64(dMilli)
normalized := usage / float64(numCores)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might be missing some context here concerning the nature of the recording.
This will give us a rate of cpu usage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it gives the total CPU usage of a beat between the last two samples.

I used the method from metricbeat/modules/system/process/helper. I thought about extracting the whole Process and ProcStats and moving it libbeat. However, it seemed unnecessary to move such a big part of metricbeat into libbeat for only this minor calculation. But I am open to reconsider moving it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All fine then! thanks @kvch !

}
newVal = round / pow
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐼 I wish it was in the standard library, I think we should make this method takes an argument for the precision and create a method in the metrics that use the maxDecimalPlaces, since it's in the common file I don't think we should make arbitrary decision on the precision.

But It can be up to discussion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented it

"runtime"

sigar "github.com/elastic/gosigar"
"github.com/elastic/libbeat/common"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this should be github.com/elastic/beats/libbeat/common

OneMinute: Round(m.sample.One),
FiveMinute: Round(m.sample.Five),
FifteenMinute: Round(m.sample.Fifteen),
OneMinute: common.Round(m.sample.One),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need updating to the new Round() signature.

@tsg
Copy link
Contributor

tsg commented Nov 9, 2017

The new metrics (beat.cpu.usage, beat.cpu.usage.normalized) should be added to the list here.

Btw, I think needs to be done also for the beat.info.uptime.ms metric from the other PR.

@tsg
Copy link
Contributor

tsg commented Nov 9, 2017

@ruflin I wonder if we should replace beat with agent as a namespace for these before they start being used. I'm thinking in the context of the common schema, ofc.

@kvch
Copy link
Contributor Author

kvch commented Nov 9, 2017

I added beat.cpu.* to gauges list. I also added beat.info.uptime.ms to the list in a separate commit. Should I cherry-pick it to a new branch and open a new PR?

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, I think we just need the input of @ruflin concerning the common schema.

@kvch kvch force-pushed the feature/libbeat/cpu-usage-metric branch from 9aece29 to 159661e Compare November 9, 2017 15:39
@kvch kvch added the in progress Pull request is currently in progress. label Nov 9, 2017
@kvch kvch force-pushed the feature/libbeat/cpu-usage-metric branch 2 times, most recently from 98f37b8 to 52b5634 Compare November 9, 2017 16:21

dTime := sample.time.Sub(lastSample.time)
dMilli := dTime / time.Millisecond
dCPU := int64(sample.procTimes.Total - lastSample.procTimes.Total)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we would only report sample.procTimes.Total with a timestamp + # of cores, we could leave the calculation of 1, 5 and 15 minutes average to Kibana? Like this it would be up to KB to decide which averages it wants to calculate. Perhaps I miss something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your idea. @tsullivan WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm following... is this for system load average?

cc @pickypg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it returns the number of CPU usage between the last two sampling times. Let's say that the sampling time is 1 minute. It would return the following metrics in case of constant usage:

beat.cpu.usage=0.1   // at 12:00
beat.cpu.usage=0.1   // at 12:01
beat.cpu.usage=0.1   // at 12:02

@ruflin's idea is to return a timestamp and the total CPU usage since the Beat has been started. It would look like this:

timestamp=12:00 beat.cpu.usage=0.1
timestamp=12:01 beat.cpu.usage=0.2
timestamp=12:02 beat.cpu.usage=0.3 

@ruflin please, correct me if I misunderstood your idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. So on the UI side there would be fully flexibility if it should be 1,5,15 or 1,10,30 minute average and we report only one value each time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential problem came to my mind. It might be confusing to users to see that the internal metrics are reported differently from metricbeat's format. For the sake of consistency, I would stick with the current solution. (Even if it's not as flexible as your proposal is.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my only concern with this approach is that it's different from what we're currently doing.

In the short term, can we have both? The other parts of the stack also report it in the not-rolled-up format, so we it may be confusing to present a different, albeit better approach to showing the utilization. However, longer term, I would love to see us move toward the more flexible approach, which is only possible if we already have the data (and it seems cheap to have both).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having both SGTM

@@ -71,6 +71,9 @@ var gauges = map[string]bool{
"beat.memstats.memory_total": true,
"beat.memstats.memory_alloc": true,
"beat.memstats.gc_next": true,
"beat.cpu.usage": true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsg At one point we should review these metrics to align them with metricbeat so it would be system.cpu.* .

@ruflin
Copy link
Contributor

ruflin commented Nov 9, 2017

@tsg Yes, we should review the naming of all metrics. So my comment inline.

numCores = runtime.NumCPU()
lastSample = cpuSample{
time: time.Now(),
procTimes: sigar.ProcTime{},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would omit this line because it doesn't do any more initialization than the zero value you get by default. (Same for line 77.)

)

func init() {
pid := os.Getpid()
err := lastSample.procTimes.Get(pid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get() is not implemented for all platforms so by adding this you will be limiting the platforms for which Beats can be compiled. This should probably be stubbed in gosigar, but that is currently not the case so you'll need to only use it on the platforms where it's available (linux, windows, openbsd, darwin).

pid := os.Getpid()
err := lastSample.procTimes.Get(pid)
if err != nil {
panic(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should handle the error rather than panic'ing. I would reserve the use of panics to cases that are completely unrecoverable or are programmer errors that are covered by unit tests (like hitting the default case in a switch statement that is supposed to cover all enum values). Panics on init can be really hard to debug especially on Windows.

monitoring.ReportFloat(V, "usage.normalized", normalizedCPU)
}

func getCPUUsage() (float64, float64) {
Copy link
Member

@andrewkroh andrewkroh Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest getProcessCPUUsage to make it clear that it's CPU usage for this process only vs overall CPU usage. It would be nice to have a godoc statement explaining what the return values are since there are two of them (and what their ranges are like [0, 1], [0, 100], [0, 100 * num_cpus], etc.).


dTime := sample.time.Sub(lastSample.time)
dMilli := dTime / time.Millisecond
dCPU := int64(sample.procTimes.Total - lastSample.procTimes.Total)
Copy link
Member

@andrewkroh andrewkroh Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this value be negative? (like if the counter rolls over) If we just send the raw counter value this problem goes to someone else :).

procTimes: sigar.ProcTime{},
}

if err := sample.procTimes.Get(pid); err != nil {
Copy link
Member

@andrewkroh andrewkroh Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to return the error and have the caller handle it. The caller might want to log it so we have some way to debug things when a user tells us they are only seeing 0 for CPU usage.

metrics := monitoring.Default.NewRegistry("beat")

monitoring.NewFunc(metrics, "memstats", reportMemStats, monitoring.Report)
monitoring.NewFunc(metrics, "cpu", reportCPU, monitoring.Report)
Copy link
Member

@andrewkroh andrewkroh Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Beat is running on a platform where we have no support for getting process CPU data maybe we should not register this metric?

@tsullivan I assume in the UI it would be best to tell the user (like n/a) that we don't have CPU usage data rather than showing 0. Would not sending the metric be the way to go?

@kvch kvch force-pushed the feature/libbeat/cpu-usage-metric branch from 52b5634 to e14fcb7 Compare November 14, 2017 16:46
@@ -1,16 +1,40 @@
// +build darwin freebsd linux openbsd windows

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package comment should be of the form "Package instance ..."

@kvch
Copy link
Contributor Author

kvch commented Nov 15, 2017

I opened a PR in gosigar, so crosscompile will be fixed: elastic/gosigar#83

@kvch kvch force-pushed the feature/libbeat/cpu-usage-metric branch from 7bdc2fd to 25cd8db Compare November 28, 2017 16:19
sample *sigar.LoadAverage
}

type LoadAverages struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported type LoadAverages should have comment or be unexported

return &LoadMetrics{load}, nil
}

type LoadMetrics struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported type LoadMetrics should have comment or be unexported


// CPU Core Monitor

// CPUCoreMonitor is used to monitor the usage of individual CPU cores.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported type CPUCoreMetrics should be of the form "CPUCoreMetrics ..." (with optional leading article)

}
}

func (m *CPUMetrics) Ticks() CPUTicks {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported method CPUMetrics.Ticks should have comment or be unexported

Steal uint64
}

type CPUMetrics struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported type CPUMetrics should have comment or be unexported

@@ -292,7 +293,7 @@ func (procStats *ProcStats) getProcessEvent(process *Process) common.MapStr {
// time multiplied by the number of cores as the total amount of CPU time
// available between samples. This could result in incorrect percentages if the
// wall-clock is adjusted (prior to Go 1.9) or the machine is suspended.
func GetProcCpuPercentage(s0, s1 *Process) (normalizedPct, pct float64) {
func GetProcCpuPercentage(s0, s1 *Process) (normalizedPct, pct, totalPct float64) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func GetProcCpuPercentage should be GetProcCPUPercentage

cpuTotalPct float64
cpuTotalPctNorm float64
}

type ProcStats struct {
type Stats struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported type Stats should have comment or be unexported

@@ -0,0 +1,9 @@
package process

// includeTopConfig is the configuration for the "top N processes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported type IncludeTopConfig should be of the form "IncludeTopConfig ..." (with optional leading article)

return 0.0, 0.0, 0.0, fmt.Errorf("error converting value of normalized CPU usage")
}

totalCpuUsage, ok := iTotalCpuUsage.(float64)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var totalCpuUsage should be totalCPUUsage

return 0.0, 0.0, 0.0, fmt.Errorf("error getting normalized CPU percentage: %v", err)
}

iTotalCpuUsage, err := beatState.GetValue("cpu.total.total_since_start")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var iTotalCpuUsage should be iTotalCPUUsage

Total float64
}

// CPUPercentages stores all CPU values in number of tick collected by a Beat.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported type CPUTicks should be of the form "CPUTicks ..." (with optional leading article)

@@ -441,12 +450,12 @@ func isProcessInSlice(processes []Process, proc *Process) bool {

// isWhitelistedEnvVar returns true if the given variable name is a match for
// the whitelist. If the whitelist is empty it returns false.
func (p ProcStats) isWhitelistedEnvVar(varName string) bool {
if len(p.envRegexps) == 0 {
func (s Stats) isWhitelistedEnvVar(varName string) bool {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

receiver name s should be consistent with previous receiver name procStats for Stats

V.OnRegistryStart()
defer V.OnRegistryFinished()

cpuUsage, cpuUsageNorm, totalCpuUsage, err := getCPUPercentages()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var totalCpuUsage should be totalCPUUsage

@@ -441,12 +451,12 @@ func isProcessInSlice(processes []Process, proc *Process) bool {

// isWhitelistedEnvVar returns true if the given variable name is a match for
// the whitelist. If the whitelist is empty it returns false.
func (p ProcStats) isWhitelistedEnvVar(varName string) bool {
if len(p.envRegexps) == 0 {
func (ps Stats) isWhitelistedEnvVar(varName string) bool {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

receiver name ps should be consistent with previous receiver name procStats for Stats

@kvch kvch force-pushed the feature/libbeat/cpu-usage-metric branch from a0830ef to ceebe3b Compare November 30, 2017 19:04
@@ -129,6 +129,13 @@
The percentage of CPU time spent in non-idle state.


# Total
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also needed in process as it is added for each process too.

@kvch kvch force-pushed the feature/libbeat/cpu-usage-metric branch from 986a14e to 6bbf3ff Compare December 1, 2017 10:28
@@ -14,7 +11,7 @@ type MemStat struct {
ActualUsedPercent float64 `json:"actual_used_p"`
}

func GetMemory() (*MemStat, error) {
func Get() (*MemStat, error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported function Get should have comment or be unexported

@@ -14,7 +11,7 @@ type MemStat struct {
ActualUsedPercent float64 `json:"actual_used_p"`
}

func GetMemory() (*MemStat, error) {
func Get() (*MemStat, error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported function Get should have comment or be unexported

@ruflin
Copy link
Contributor

ruflin commented Dec 6, 2017

@kvch Please ping me when an other round of review is needed here. Let's get this in :-)

@kvch kvch force-pushed the feature/libbeat/cpu-usage-metric branch from 2c9e5d0 to e4fc705 Compare December 12, 2017 18:14
@kvch kvch force-pushed the feature/libbeat/cpu-usage-metric branch 3 times, most recently from 5a93f7e to 2efe38d Compare January 4, 2018 11:45
* Add both percentage and normalized percentage of CPU.
* Move Round function to common.
@kvch kvch force-pushed the feature/libbeat/cpu-usage-metric branch from 2efe38d to d9d63a7 Compare January 4, 2018 12:10
@kvch
Copy link
Contributor Author

kvch commented Jan 4, 2018

@ruflin This PR is ready for a new review.
I removed the crosscompile jobs from Travis and opened a meta issue: #5984

.travis.yml Outdated
env: TARGETS="-C libbeat crosscompile"
go: $GO_VERSION
stage: test
# - os: linux
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leave a comment here that links to #5984

@kvch kvch force-pushed the feature/libbeat/cpu-usage-metric branch from d9d63a7 to 0f0f63b Compare January 4, 2018 15:37
@kvch kvch removed the in progress Pull request is currently in progress. label Jan 4, 2018
Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to merge this PR and directly open a follow up PR trying to fix the crosscompile part to reenable it. Also keeping and eye on the packaging to make sure we didn't break it.

@@ -1,20 +1,19 @@
// +build darwin freebsd linux openbsd windows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason this was removed?

@ruflin ruflin merged commit e18cff5 into elastic:master Jan 5, 2018
@ruflin ruflin mentioned this pull request Jan 5, 2018
ruflin added a commit to ruflin/beats that referenced this pull request Jan 5, 2018
In elastic#5545 the crosscompile builds were disable because they didn't work with the changes. This PR reenables the crosscompiling by making the following changes:

* Introduce build flags for the metrics. Currently only build on darwin, linux, windows and exclude if cgo is not available
* Readd build flags to cpu and memory. They were removed in the previous PR but not sure why.
* Uncomment crosscompile builds for travis

This only brings back the previous behaviour. Additional changes should be made to support monitoring also on missing platforms.

See elastic#5984
kvch pushed a commit that referenced this pull request Jan 8, 2018
In #5545 the crosscompile builds were disable because they didn't work with the changes. This PR reenables the crosscompiling by making the following changes:

* Introduce build flags for the metrics. Currently only build on darwin, linux, windows and exclude if cgo is not available
* Readd build flags to cpu and memory. They were removed in the previous PR but not sure why.
* Uncomment crosscompile builds for travis

This only brings back the previous behaviour. Additional changes should be made to support monitoring also on missing platforms.

See #5984
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants