Influx-Capacitor service self monitoring #57

marcelopetersen · 2016-02-26T14:18:42Z

In my monitoring system, normally I check the state of a service to ensure that is running.
When monitor influx-capacitor service, even its running we have no guarantees that it is collecting data. For example, if connection to database is unavailable:

Log Name: Application
Source: Tharga.Toolkit.Console
Date: 26/02/2016 10:06:30
Event ID: 0
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: ComputerName
Description:
Could not establish a connection to the database.

We could have an option to configure an URL where the service will send current state at specific times, and possible error messages like database unavailable. It would work as a heartbeat monitoring.

Something like this:

<Influx-Capacitor>
  <HealthCheck Type="Nagios|OpsMgr|Custom" Enabled="true" SecondsInterval="60" SendErrorMessages="true">
    <MachineName>MyComputer</MachineName>
    <Url>http://mymonitor.com/AgentStatus</Url>
  </HealthCheck>
<Influx-Capacitor>

poxet · 2016-02-27T13:56:39Z

You mean like a heartbeat?

I have also started to add log4net so that it would be possible to debug issues.

marcelopetersen · 2016-02-27T14:08:59Z

Yes, like a heartbeat. Log4net is useful to local debug, but how to identify that service is having error to send data to database? If service is up, but cannot access the database, we have no option to identify this error. Nowadays, we must access the machine and search on event viewer.

marcelopetersen · 2016-02-29T11:08:04Z

What do you think about, when the service cannot access the database, instead of we have monitoring, the service goes down? It will be easier to monitor because we just need to check service state.
My concern is about machine cannot send data and how to identify that problem (we have a lot of machines sending data).

poxet · 2016-02-29T12:06:48Z

I think a heartbeat with information about the latest issues is a good idea. That would make it possible to monitor several machines in one place.

nathanwebb · 2016-02-29T23:35:46Z

This would tie in nicely with #29. The heartbeat could be as simple as a timestamp with status, sent to a central database.

Some scenarios:

If the database is down - don't worry about the agents, just fix the database ;)
If the agent can't reach the database: Since the central database would have the configuration for that agent as well, it would be easy to calculate when the next heartbeat should be sent. If the agent misses a heartbeat (+ some leeway), then you have an incident.
If some data is sent, but not all: Again, since the database contains the configuration, it should be able to see what is supposed to be sent. If everything look OK, but some data is sent, then this could still be identified.

poxet added the enhancement label Feb 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Influx-Capacitor service self monitoring #57

Influx-Capacitor service self monitoring #57

marcelopetersen commented Feb 26, 2016

poxet commented Feb 27, 2016

marcelopetersen commented Feb 27, 2016

marcelopetersen commented Feb 29, 2016

poxet commented Feb 29, 2016

nathanwebb commented Feb 29, 2016

Influx-Capacitor service self monitoring #57

Influx-Capacitor service self monitoring #57

Comments

marcelopetersen commented Feb 26, 2016

poxet commented Feb 27, 2016

marcelopetersen commented Feb 27, 2016

marcelopetersen commented Feb 29, 2016

poxet commented Feb 29, 2016

nathanwebb commented Feb 29, 2016