Skip to content

Latest commit

 

History

History
313 lines (246 loc) · 10.9 KB

README.md

File metadata and controls

313 lines (246 loc) · 10.9 KB

pingmachine

Pingmachine - Smokeping-like pinging framework

Introduction

Monitoring components often need to do latency/loss measurements by sending regularly some probes, typically ICMP ECHO (ping). Pingmachine was born out of the idea to have a common framework that can take care of doing this "pinging". That is useful, because pinging is more complicated than it seems: it needs to be done efficiently (don't for too much), asynchronously (pinging takes time) and data needs to be stored in a way that the most useful statistics can be extracted.

The following were requirements when designing pingmachine (which might help understand the reason for its architecture):

  • Configurable targets
  • Extensible probing methods (not only ping)
  • High performance (should handle 10'000 targets)
  • Storage of the measurements distribution plus the median value, if multiple probes to the same target are sent at each iteration.
  • Consolidation using maximum, minimum and average functions.
  • Management of old targets data (the list of targets change rather frequently!)

Smokeping vs. Pingmachine

After initial tests with Smokeping, we decided to write our own framework for starting "ping-jobs" and collecting measurements. Our script is much simpler than Smokeping, which is a complete monitoring solution, including a Web-site to browse results. Nonetheless, "Pingmachine" is heavily inspired by Smokeping. The main differences are:

  • Smokeping is designed as a complete application, including the frontend. This is noticed in many places, for example in what must be configured for it to run at all (URL address, etc.)
  • The Smokeping configuration is not dynamic. If it changes, then Smokeping needs to be reloaded. This could be a problem for us if we want to use it for highly dynamic applications (sang?)
  • Generation of graphs doesn't work out of the box, because the graphs are generated by the CGI when displaying a page. We can't only ask Smokeping to generate a particular graph.

Interface for Applications

One of the things that pingmachine is optimized for, is high dynamicity of the configuration. What needs to be measured could change every minute and pingmachine should be able to handle it.

Order Interface

Pingmachine uses a directory where configuration "snippets" are put by whoever wants something to be done. These configuration snippets are called orders

Pingmachine monitors the /orders and immediately picks up new and modified orders and does what is necessary to start with the measurements. If the order file is deleted, pingmachine will also automatically stop with that monitoring. Please do not put the order file directly into the orders directory (see old order interface), but in a subdirectory of your choise, eventually structured with further subdirectories as you like.

The output produced by the monitoring work will be put in a separate directory (/output). It will reflect the structure of the input folder:

                .---> /orders/... >---,
               /                       \
Application---.                         .--- pingmachine <---> fping
               \                       /
                `---< /output/... <---'

Orders Specification

An order is typically one target IP address that needs to be monitored. If an application needs to monitor an IP address, it just writes the corresponding order file in the "/orders" directory

Orders also need to be periodically refreshed (at least once an hour), to make sure that pingmachine continues with the measurements. This mechanism is needed to make sure that no stale configuration has the consequence of monitoring an IP address indefinitely.

A example order could be (in YAML):

 user: tmon
 task: tun_eth0_4061
 step: 300
 pings: 20
 probe: fping
 fping:
     host: 62.179.116.250

To additionally forward the measurements to telegraf, the order can optionally be extended with measurement name and tags:

 measurement_name: tunnel
 tags:
     tunnel_id: 12458
     remote_host: 5292
     interface: eth2
     remote_interface: eth2

The relative path from the orders/ directory is the "order id" and it is chosen by the client and that it is unique for all applications and all targets. We suggest to use a subdirectory per application and an application uniqe name for the order file.

Be aware that there are file file system limitation for the number of links that an inode can have, therefore paths of arbitrary depth in orders/ are supported.

The file-system tree could look as follows:

 /var/lib/pingmachine/
 |---- orders/
 | |---- app1/
 | | |---- target1
 | | `---- target2
 | |----...

Note that orders are to be considered dynamic configuration. The "users" (tmon, etc.) are usually high-level programs which will just install order as needed. Pingmachine will then do the measurements as instructed. In other words: the complete /var/lib/pingmachine directory can be completely deleted and recreated (with the loss of measured data, however).

Old Order Interface

To make this possible, pingmachine uses a directory where configuration "snippets" are put by whoever wants something to be done. These configuration snippets are called orders and telegraf. The configuration of one order or telegraf order is not allowed to change. Orders and telegraf orders can only be added or removed.

Pingmachine monitors these /orders and /telegraf directories, immediately picks up new orders and telegraf orders, and does what is necessary to start with the measurements. If the order file is deleted, pingmachine will also automatically stop with that monitoring.

The output produced by the monitoring work, will be put in a separate directory (/output), which can then be used by who give the order to fetch the data:

                   .----> /orders >----,
                  /       /telegraf     \
  Application----.                       .---- pingmachine <----> fping
                  \                     /
                   `----< /output <----'

Orders and Telegraf Orders Specification

An order is typically one target IP address that needs to be monitored. If an application needs to monitor an IP address, it just writes the corresponding order file in the "/orders" directory (more about the exact directory structure later). It additionally writes a corresponding telegraf file in the "telegraf" directory if the IP address belongs to a tunnel.

Orders and telegraf orders also need to be periodically refreshed (at least once an hour), to make sure that pingmachine continues with the measurements. This mechanism is needed to make sure that no stale configuration has the consequence of monitoring an IP address indefinitely.

A example order could be (in YAML):

 user: tmon
 task: tun_eth0_4061
 step: 300
 pings: 20
 probe: fping
 fping:
     host: 62.179.116.250

A example telegraf order could be (in YAML):

 measurement_name: tunnel
 tags:
     tunnel_id: 12458
     remote_host: 5292
     interface: eth2
     remote_interface: eth2

The file name (the order "id") is determined by calculating the md5 checksum on the file contents. This makes sure that different orders have different identifiers and also that, if the same order reappears, it is going to have the same id. The file name of the existing telegraf files is the same as the one of the corresponding order file.

The file-system tree could look as follows:

 /var/lib/pingmachine/
 |---- orders/
 | |---- 6dd803dc5d29b72564467de7ddbfc695
 | `---- cd7d89acdba05cef56184db4a7b044ea
 |---- telegraf/
 | |---- 6dd803dc5d29b72564467de7ddbfc695
 | `---- cd7d89acdba05cef56184db4a7b044ea

Note that orders and telegraf orders are to be considered dynamic configuration. The "users" (tmon, etc.) are usually high-level programs which will just install order as needed. Pingmachine will then do the measurements as instructed. In other words: the complete /var/lib/pingmachine directory can be completely deleted and recreated (with the loss of measured data, however).

Output

The generated RRD file is put in a order-specific directory created under the output tree. Each directory contains one rrd file: main.rrd. The definition of the RRD file is the same for all probes and its creation handled by the main script. This allows us to interchange probe types easily and to offload some work out of the probe modules.

 |---- output/
 | |---- app1/
 | | `---- target1/
 | |   |---- main.rrd
 | |   `---- last_result
 | |---- 6dd803dc5d29b72564467de7ddbfc695/
 | | |---- main.rrd
 | | `---- last_result
 | `---- ...

Also, the output directory also contains a last_result file, with just the latest result of the pinging. It is meant to be used by programs that only need information about the latest ping job. The format of the file is as follows:

 time: 1310116500
 updated: 1310116524
 step: 20
 pings: 1
 loss: 0
 min: 5.700000e-04
 median: 5.700000e-04
 max: 5.700000e-04

Archiving

When an order is explicitly deleted or has timed out, the corresponding output directory is moved to the archive directory:

 |---- archive/
 | |---- app1
 | | `---- target2
 | |   `---- main.rrd
 | |---- 2b45d6a19d2c3684767440fcb2f0b0c9/
 | | `---- main.rrd
 | |---- ...

As soon as the "order" file of an archived order is put again into the orders directory, pingmachine will move the output data into place again. It should not be possible to have both data in output and in archive for the same order.

Supported Probes

fping

fping:
    host: 213.156.230.57
    interface: eth0
    source_ip: 10.0.0.12

Note that the source interface and IP are optional. The fping utility must be installed on the system.

httping

httping:
    url: http://www.example.com
    user_agent: pingmachine
    proxy: http://10.0.0.24:8080
    http_codes_as_failure: 403,407,503

Note that the user_agent, the proxy and the http_codes_as_failure configuration are optional. The httping utility must be installed on the system.

SCION SCMP and pingpong tool

sping:
    host: 213.156.230.57
    interface: eth0
    source_ip: 10.0.0.12
    flags:


pping:
    host: 213.156.230.57
    interface: eth0
    source_ip: 10.0.0.12
    flags:

Sent metrics

For every telegraf file, both metrics gathered by pingmachine and metrics provided in the telegraf file are sent

Installation

Required perl modules:

  • AE
  • AnyEvent
  • Log::Any
  • Log::Any::Adapter::Dispatch
  • InfluxDB::LineProtocol
  • IO::Socket
  • Mouse
  • MouseX::NativeTraits
  • RRDs (RRDtool)
  • Term::ANSIColor
  • Try::Tiny
  • YAML::XS

License

See the LICENSE file for usage conditions.