Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topbeat: can't reliably retrieve proc.cpu.total_p value from ES #1009

Closed
nktl opened this issue Feb 19, 2016 · 10 comments
Closed

Topbeat: can't reliably retrieve proc.cpu.total_p value from ES #1009

nktl opened this issue Feb 19, 2016 · 10 comments
Labels

Comments

@nktl
Copy link

nktl commented Feb 19, 2016

OK, I spent almost two days dealing with this problem and going a bit crazy now. Possibly missing something silly and obvious.

Setup:

  • Vanilla, standalone ES box running official Elasticsearch 2.2 from RPM package on RHEL 6
  • Topbeat 1.2 binary compiled from master (I need 'cmdline' functionality) running on various servers and pushing data to ES
  • applied 'topbeat.template.json.txt' template to ES and verified all fields are of the correct type with 'not_analyzed' property set
  • all metrics are collected successfully every 10 seconds and I can see that see in Kibana that all the data is correct and valid

Problem:

Trying to graph CPU usage data per processes/host via Grafana. It looks like there is some float rounding issue for cpu-specific "_p" metrics. Using direct query to ES, the retrieved value is always either 0.0 or 1.0 where It should be a float value from this range (where 1.0 is 100%). The query I use was initially constructed by Grafana and it looks as follows (proc.pid is unique in this scenario):

curl -XPOST 'http://localhost:9200/_search?pretty' -d '
{"size":0,"query":{"filtered":{"query":{"query_string":{"analyze_wildcard":true,"query":"proc.pid:51789"}}}},"aggs":{"2":{"date_histogram":{"interval":"10s","field":"@timestamp","min_doc_count":0},"aggs":{"1":{"avg":{"field":"proc.cpu.total_p"}}}}}}'

This results with a bunch of metrics with '0' value (and occasionally 1.0, where CPU usage is >=100%), like:

   "key_as_string" : "2016-02-19T17:54:30.000Z",
   "key" : 1455904470000,
   "doc_count" : 1,
   "1" : {
     "value" : 0.0
   }

The problem also exists for host-level CPU % metric like 'cpu.system_p'. It does NOT occur for RAM-related % metrics, like proc.mem.rss_p, for instance the following query works fine:

curl -XPOST 'http://localhost:9200/_search?pretty' -d '
{"size":0,"query":{"filtered":{"query":{"query_string":{"analyze_wildcard":true,"query":"proc.pid:51789"}}}},"aggs":{"2":{"date_histogram":{"interval":"10s","field":"@timestamp","min_doc_count":0},"aggs":{"1":{"avg":{"field":"proc.mem.rss_p"}}}}}}'

The result is:

 {
"key_as_string" : "2016-02-19T18:07:10.000Z",
"key" : 1455905230000,
"doc_count" : 1,
"1" : {
  "value" : 0.019999999552965164

The problem also does NOT exist for any non-% metrics, like 'proc.cpu.total' or 'cpu.system', the retrieved data is valid (although not terribly useful).

As mentioned, the data for all '_p' values visible in Kibana seems to be correct for all cases, for instance:

    "proc": {
      "cmdline": "xxx",
      "cpu": {
        "start_time": "03:12",
        "system": 68580,
        "total": 12651370,
        "total_p": 0.2411,
        "user": 12582790
      },
      "mem": {
        "rss": 16055091200,
        "rss_p": 0.02,
        "share": 1384448,
        "size": 19614089216
      },
      "name": "xxx",
      "pid": 51789,
      "ppid": 1,
      "state": "sleeping",
      "username": "user"
    }

Additionally:

  • Can't see any difference in terms of mapping between for instance proc.mem.rss_p (which works) and proc.cpu.total_p (which does not). The values are also very similar in terms of scale and the mapping is identical (float, not_analyzed), for instance:
     "proc":{  
                  "properties":{  
                     "cpu":{  
                        "properties":{  
                           "user_p":{  
                              "type":"float"
                           }
                        }
                     },
                     "mem":{  
                        "properties":{  
                           "rss_p":{  
                              "type":"float"
                           }
                        }
                     }
                  }
               },

Troubleshooting:

  • Tried modifying my query in various ways (syntax, aggr types, using bool query - no luck)
  • Additionally tested with Elasticsearch 1.7.5 and 3.0-SNAPSHOT compiled from master - in both cases problem still exist, even after changing the query syntax so it conforms to 3.0 specification.
  • Changed mapping for affected fields from float to double in a desperate attempt to increase the precision - no difference

I have absolutely no clue what is going on here, as it seems logical that querying 'proc.mem.rss_p' should produce the same kind of behavior as 'proc.cpu.total_p' - but it does not... It is possible I am missing something obvious (some problem with Grafana query?) and would be very grateful for any advice.

@tsg tsg added Topbeat discuss Issue needs further discussion. labels Feb 19, 2016
@tsg
Copy link
Contributor

tsg commented Feb 19, 2016

To me it sounds a bit like a mapping issue. Can you check if there's any difference in the mapping between the CPU and mem field? Simply query the index name and it will print you the mapping.

You might also try to see if you can reproduce the same issue with Kibana.

@tsg tsg added question and removed discuss Issue needs further discussion. labels Feb 19, 2016
@nktl
Copy link
Author

nktl commented Feb 20, 2016

The mapping is identical, I am afraid.
Can you replicate the problem using ES queries I posted or is the issue somehow specific to my environment?

@tsg
Copy link
Contributor

tsg commented Feb 20, 2016

I'll try to reproduce it, but I'm currently traveling (along with the whole team), so I'm not sure when I'll get to it.

@ku1ik
Copy link

ku1ik commented Feb 26, 2016

I'm having the same issue. I'm running very similar setup as @nktl (the only diff is I'm on CentOS 7, which is actually not a big diff).

@nktl
Copy link
Author

nktl commented Mar 3, 2016

Hey, any luck with this guys? It looks like the problem in fact can be replicated, as per @sickill comment.

@monicasarbu
Copy link
Contributor

@nktl @sickill It might be a problem with calculating the cpu usage per process. Our implementation is similar with psutil library, so you can check if the results returned by this library are similar with what you expect. Are you comparing with the results reported by the top command?
After you install python, you can execute the following commands in a python terminal to print 30 times the cpu usage in percentage (simulating the top command):

import psutil
for x in range(30):
     psutil.cpu_percent(interval=10)

Do you get approximate the same values of the cpu usage per process with topbeat? Is top command returning a different range of values?

@nktl
Copy link
Author

nktl commented Mar 7, 2016

Thanks for chiming in, @monicasarbu.
I ran the piece of python you provided and it reports values consistent with those from top.

I don't think the problem is with the data collection procedure itself, as I can see correct data in Kibana for per process CPU metrics collected by topbeat. The issue is that whenever I try to retrieve those values from ES using the queries I posted in my first post, it looks like the values get casted to 'int' for some crazy reason - so any decimal prevision is lost (you essentially get either 0 or 1 back).

@monicasarbu
Copy link
Contributor

nktl Thank you for clarifications. The problem might be that you have a mixture of data in Elasticsearch, some inserted before applying the template (as int) and some after you loaded you template (as float). In this case Elasticsearch tries sometimes to convert the percentages (float) to the default mapping that is "int".
I would recommend trying to delete your old data in Elasticsearch to have only data inserted after loading the template. We are working on a solution to load automatically the template at the Beat startup that will fix these issues: #639.

@nktl
Copy link
Author

nktl commented Mar 9, 2016

Thanks, it looks like this is exactly what was happening. I had topbeat instances running when clearing the index and applying the template, so it is very likely some of the data got inserted in int format, before template took an effect. The unexpected part is for ES to convert all values to int when doing direct queries, based just on a few initial values, but at the same time to display proper 'float' values in Kibana.

It looks like stopping all the instances and doing full cleanup of the index + applying the template fixed this issue - my queries work as expected now. Many thanks for your assistance with this.

@tsg
Copy link
Contributor

tsg commented Apr 29, 2016

Issue seems resolved, thanks @monicasarbu.

@tsg tsg closed this as completed Apr 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants