Skip to content

Latest commit

 

History

History
171 lines (146 loc) · 6.79 KB

README.md

File metadata and controls

171 lines (146 loc) · 6.79 KB

cf-metrics-refinery

Sample Grafana dashboard

cf-metrics-refinery reads Cloud Foundry application metrics and logs, enriches them with the application metadata from the Cloud Foundry API and then forwards data to a sink.

Currently cf-metrics-refinery is capable of reading the application logs/metrics from Kafka and then writing the transformed/enriched results to InfluxDB.

Regarding the enriching part, the metedata is originally from Cloud Foundry API, then cached in the memory.

  • fetch a fresh copy of all metadata: every 10m
  • metadata is considered expired: after 3m since last used
  • check for expired metadata: every 1m

cf-metrics-refinery was initally designed to read from Kafka because metadata enrichment can block, so Kafka can act as a buffer. The events generated by the Firehose are stored in Kafka topics. You can specify one or more Kafka topic to consume from. The events in Kafka are expected to be JSON-encoded using the format defined in sonde-go (the kafka-firehose-nozzle is designed for this specific task).

Metrics pipeline

We plan to add support for additional input adapters (Firehose) and output adapters (Kafka/Syslog/...) in the future. Adding adapters requires simply implementing a input.Reader or output.Writer.

Currently ContainerMetric, LogMessage and HttpStartStop events are supported to produce the following InfluxDB events:

  • each HttpStartStop event is stored as-is (tagged by instance index, HTTP status code and method)
  • LogMessages originating from the application (i.e. excluding from API, cell, router) is transformed to only include the length in bytes of the payload and stored (tagged by instance index and by FD, STDOUT or STDERR)
  • each ContainerMetric event is stored as-is (tagged by instance index)

In addition to the tags above, each event also includes tags for org (name/guid), space (name/guid) and app (name/guid).

When flushing points to Influxdb,

  • flush pending events: time-based(every 3s) and size-based(5000 points) Notice: currently if messages stop coming, the time-based flush won't happen.
  • retries when flush fails: 3 times
  • timeout of checking Influxdb is up: 30s

Usage

Basic usage is,

$ cf-metrics-refinery [options]

The following are available options,

-h                 Display the manual for cf-metrics-refinery, including all
                   configuration environment variables.
-log-level LEVEL   Log level. Default level is INFO (DEBUG|INFO|ERROR)

Configuration

cf-metrics-refinery is configured using environment variables. To get the list of environment variables run cf-metrics-refinery -h

Install

cf-metrics-refinery can be deployed as Cloud Foundry application using the go-buildpack.

Since

# push the app
cf push cf-metrics-refinery --no-start

# configure cf-metrics-refinery (alternatively, use a manifest)
cf set-env cf-metrics-refinery "CFMR_..." "..." 
...

# start
cf start cf-metrics-refinery

Contributing

Test

Unit tests

go test ./...

Crash tests

Influxdb

  • The Influxdb is not available, when starting the app: Behaviors:
    • check Influxdb is up (timeout: 30s)
    • log errors
    • sleep (30s) to avoid flapping instances
    • exit
    • not lose data
  • The Influxdb job is not available: Behaviors:
    • retry (FlushRetries: 3 times by default)
    • log errors
    • exit
    • not lose data
  • The Influxdb VM is not available: Behaviors:
    • timeout (set by env variable CFMR_INFLUXDB_TIMEOUT)
    • retry (FlushRetries: 3 times by default)
    • log errors
    • exit
    • not lose data

ZK

  • The ZK is not available, when starting the app: Behaviors:
    • retry
    • log errors
    • exit
    • not lose data
  • The ZK job is not available: Behaviors:
    • retry
    • log errors
    • exit
    • not lose data
  • The ZK VM is not available: Behaviors:

Kafka

  • The Kafka job is not available: Behaviors:
    • retry
    • log errors
    • exit
    • not lose data
  • The Kafka VM is not available: Behaviors:

Cloud Controller API

  • The CC API is not available, when starting the app: Behaviors:
    • log errors
    • exit
    • not lose data
  • The CC or UAA or CCDB is not available, while the app is running normally: Behaviors:
    • timeout (set by env variable CFMR_CF_TIMEOUT)
    • retry (by default: 2)
    • fail to refresh metedata
    • log errors
    • use the cache to enrich
    • not lose data, but the data may not be 100% correct.

Design

The idea is:

  • have a input emit events containing sonde-go events.Envelope wrapped with additional fields
    • we would have two type of input, acknowledged (kafka) and non-acknowledged (firehose)
  • pass each event to the Enricher
    • currently we cache the metadata from CF in memory. This means that we don't need an external store, but this also makes it hard to ensure that the cached metadata is consistent across parallel instances of cf-metrics-refinery.
    • Moreover, to decrease the meaningless calls of CC API, we implement the negative lookup cache layer which stores the app guids unable to be found from CC API in the memory.
  • pass each enriched event to the output
    • we would have two type of output, acknowledged (kafka, influx) and non-acknowledged (none right now)
  • ack handling is tricky:
    • if both input and output are of ack type, when the output acks one or more messages, we pass this info to the input
    • if input is ack and output is non-ack, each message is ack immediately to the input
    • if input is non-ack nothing is done
    • the additional fields in the event are used for ack purposes (correlation of output messages to input messages)
  • expose stats http endpoint for debugging or even monitoring
  • this component does not do aggregation: this is delegated to the drains targeted by the outputs
  • sarama does not natively support zk-based consumer groups and offset tracking

Copyright

2018 Rakuten, Inc.

License

MIT