Topbeat: can't reliably retrieve proc.cpu.total_p value from ES #1009

nktl · 2016-02-19T18:43:03Z

OK, I spent almost two days dealing with this problem and going a bit crazy now. Possibly missing something silly and obvious.

Setup:

Vanilla, standalone ES box running official Elasticsearch 2.2 from RPM package on RHEL 6
Topbeat 1.2 binary compiled from master (I need 'cmdline' functionality) running on various servers and pushing data to ES
applied 'topbeat.template.json.txt' template to ES and verified all fields are of the correct type with 'not_analyzed' property set
all metrics are collected successfully every 10 seconds and I can see that see in Kibana that all the data is correct and valid

Problem:

Trying to graph CPU usage data per processes/host via Grafana. It looks like there is some float rounding issue for cpu-specific "_p" metrics. Using direct query to ES, the retrieved value is always either 0.0 or 1.0 where It should be a float value from this range (where 1.0 is 100%). The query I use was initially constructed by Grafana and it looks as follows (proc.pid is unique in this scenario):

curl -XPOST 'http://localhost:9200/_search?pretty' -d '
{"size":0,"query":{"filtered":{"query":{"query_string":{"analyze_wildcard":true,"query":"proc.pid:51789"}}}},"aggs":{"2":{"date_histogram":{"interval":"10s","field":"@timestamp","min_doc_count":0},"aggs":{"1":{"avg":{"field":"proc.cpu.total_p"}}}}}}'

This results with a bunch of metrics with '0' value (and occasionally 1.0, where CPU usage is >=100%), like:

   "key_as_string" : "2016-02-19T17:54:30.000Z",
   "key" : 1455904470000,
   "doc_count" : 1,
   "1" : {
     "value" : 0.0
   }

The problem also exists for host-level CPU % metric like 'cpu.system_p'. It does NOT occur for RAM-related % metrics, like proc.mem.rss_p, for instance the following query works fine:

curl -XPOST 'http://localhost:9200/_search?pretty' -d '
{"size":0,"query":{"filtered":{"query":{"query_string":{"analyze_wildcard":true,"query":"proc.pid:51789"}}}},"aggs":{"2":{"date_histogram":{"interval":"10s","field":"@timestamp","min_doc_count":0},"aggs":{"1":{"avg":{"field":"proc.mem.rss_p"}}}}}}'

The result is:

 {
"key_as_string" : "2016-02-19T18:07:10.000Z",
"key" : 1455905230000,
"doc_count" : 1,
"1" : {
  "value" : 0.019999999552965164

The problem also does NOT exist for any non-% metrics, like 'proc.cpu.total' or 'cpu.system', the retrieved data is valid (although not terribly useful).

As mentioned, the data for all '_p' values visible in Kibana seems to be correct for all cases, for instance:

    "proc": {
      "cmdline": "xxx",
      "cpu": {
        "start_time": "03:12",
        "system": 68580,
        "total": 12651370,
        "total_p": 0.2411,
        "user": 12582790
      },
      "mem": {
        "rss": 16055091200,
        "rss_p": 0.02,
        "share": 1384448,
        "size": 19614089216
      },
      "name": "xxx",
      "pid": 51789,
      "ppid": 1,
      "state": "sleeping",
      "username": "user"
    }

Additionally:

Can't see any difference in terms of mapping between for instance proc.mem.rss_p (which works) and proc.cpu.total_p (which does not). The values are also very similar in terms of scale and the mapping is identical (float, not_analyzed), for instance:

     "proc":{  
                  "properties":{  
                     "cpu":{  
                        "properties":{  
                           "user_p":{  
                              "type":"float"
                           }
                        }
                     },
                     "mem":{  
                        "properties":{  
                           "rss_p":{  
                              "type":"float"
                           }
                        }
                     }
                  }
               },

Troubleshooting:

Tried modifying my query in various ways (syntax, aggr types, using bool query - no luck)
Additionally tested with Elasticsearch 1.7.5 and 3.0-SNAPSHOT compiled from master - in both cases problem still exist, even after changing the query syntax so it conforms to 3.0 specification.
Changed mapping for affected fields from float to double in a desperate attempt to increase the precision - no difference

I have absolutely no clue what is going on here, as it seems logical that querying 'proc.mem.rss_p' should produce the same kind of behavior as 'proc.cpu.total_p' - but it does not... It is possible I am missing something obvious (some problem with Grafana query?) and would be very grateful for any advice.

The text was updated successfully, but these errors were encountered:

tsg · 2016-02-19T18:54:17Z

To me it sounds a bit like a mapping issue. Can you check if there's any difference in the mapping between the CPU and mem field? Simply query the index name and it will print you the mapping.

You might also try to see if you can reproduce the same issue with Kibana.

nktl · 2016-02-20T13:25:32Z

The mapping is identical, I am afraid.
Can you replicate the problem using ES queries I posted or is the issue somehow specific to my environment?

tsg · 2016-02-20T23:53:34Z

I'll try to reproduce it, but I'm currently traveling (along with the whole team), so I'm not sure when I'll get to it.

ku1ik · 2016-02-26T17:24:47Z

I'm having the same issue. I'm running very similar setup as @nktl (the only diff is I'm on CentOS 7, which is actually not a big diff).

nktl · 2016-03-03T16:30:04Z

Hey, any luck with this guys? It looks like the problem in fact can be replicated, as per @sickill comment.

monicasarbu · 2016-03-04T23:58:20Z

@nktl @sickill It might be a problem with calculating the cpu usage per process. Our implementation is similar with psutil library, so you can check if the results returned by this library are similar with what you expect. Are you comparing with the results reported by the top command?
After you install python, you can execute the following commands in a python terminal to print 30 times the cpu usage in percentage (simulating the top command):

import psutil
for x in range(30):
     psutil.cpu_percent(interval=10)

Do you get approximate the same values of the cpu usage per process with topbeat? Is top command returning a different range of values?

nktl · 2016-03-07T08:45:21Z

Thanks for chiming in, @monicasarbu.
I ran the piece of python you provided and it reports values consistent with those from top.

I don't think the problem is with the data collection procedure itself, as I can see correct data in Kibana for per process CPU metrics collected by topbeat. The issue is that whenever I try to retrieve those values from ES using the queries I posted in my first post, it looks like the values get casted to 'int' for some crazy reason - so any decimal prevision is lost (you essentially get either 0 or 1 back).

monicasarbu · 2016-03-07T12:30:07Z

nktl Thank you for clarifications. The problem might be that you have a mixture of data in Elasticsearch, some inserted before applying the template (as int) and some after you loaded you template (as float). In this case Elasticsearch tries sometimes to convert the percentages (float) to the default mapping that is "int".
I would recommend trying to delete your old data in Elasticsearch to have only data inserted after loading the template. We are working on a solution to load automatically the template at the Beat startup that will fix these issues: #639.

nktl · 2016-03-09T08:40:49Z

Thanks, it looks like this is exactly what was happening. I had topbeat instances running when clearing the index and applying the template, so it is very likely some of the data got inserted in int format, before template took an effect. The unexpected part is for ES to convert all values to int when doing direct queries, based just on a few initial values, but at the same time to display proper 'float' values in Kibana.

It looks like stopping all the instances and doing full cleanup of the index + applying the template fixed this issue - my queries work as expected now. Many thanks for your assistance with this.

tsg · 2016-04-29T10:56:09Z

Issue seems resolved, thanks @monicasarbu.

tsg added Topbeat discuss Issue needs further discussion. labels Feb 19, 2016

tsg added question and removed discuss Issue needs further discussion. labels Feb 19, 2016

tsg closed this as completed Apr 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topbeat: can't reliably retrieve proc.cpu.total_p value from ES #1009

Topbeat: can't reliably retrieve proc.cpu.total_p value from ES #1009

nktl commented Feb 19, 2016

tsg commented Feb 19, 2016

nktl commented Feb 20, 2016

tsg commented Feb 20, 2016

ku1ik commented Feb 26, 2016

nktl commented Mar 3, 2016

monicasarbu commented Mar 4, 2016

nktl commented Mar 7, 2016

monicasarbu commented Mar 7, 2016

nktl commented Mar 9, 2016

tsg commented Apr 29, 2016

Topbeat: can't reliably retrieve proc.cpu.total_p value from ES #1009

Topbeat: can't reliably retrieve proc.cpu.total_p value from ES #1009

Comments

nktl commented Feb 19, 2016

tsg commented Feb 19, 2016

nktl commented Feb 20, 2016

tsg commented Feb 20, 2016

ku1ik commented Feb 26, 2016

nktl commented Mar 3, 2016

monicasarbu commented Mar 4, 2016

nktl commented Mar 7, 2016

monicasarbu commented Mar 7, 2016

nktl commented Mar 9, 2016

tsg commented Apr 29, 2016