Skip to content
This repository has been archived by the owner on Apr 24, 2023. It is now read-only.

Add out_kafka daemonset #11

Closed
wants to merge 9 commits into from
Closed

Add out_kafka daemonset #11

wants to merge 9 commits into from

Conversation

solsson
Copy link
Contributor

@solsson solsson commented Nov 11, 2017

Based on fluent/fluent-bit#94, currently the dev build from fluent/fluent-bit#94 (comment)

A typical record (pretty-printed using jq):

  {
    "kubernetes": {
      "docker_id": "dd50658b5d1110070dfa24511b7c2e6095bc210b7807617bc9c7e9705c64be28",
      "container_name": "fluent-bit",
      "host": "minikube",
      "annotations": {
        "kubernetes.io/created-by": "{\\\"kind\\\":\\\"SerializedReference\\\",\\\"apiVersion\\\":\\\"v1\\\",\\\"reference\\\":{\\\"kind\\\":\\\"DaemonSet\\\",\\\"namespace\\\":\\\"logging\\\",\\\"name\\\":\\\"fluent-bit\\\",\\\"uid\\\":\\\"7b5e7b6d-c6ef-11e7-ada5-080027039766\\\",\\\"apiVersion\\\":\\\"extensions\\\",\\\"resourceVersion\\\":\\\"1695\\\"}}\\n"
      },
      "labels": {
        "version": "v1",
        "pod-template-generation": "1",
        "kubernetes.io/cluster-service": "true",
        "k8s-app": "fluent-bit-logging",
        "controller-revision-hash": "2050036262"
      },
      "pod_id": "af021eb0-c6ef-11e7-ada5-080027039766",
      "namespace_name": "logging",
      "pod_name": "fluent-bit-qpz69"
    },
    "time": "2017-11-11T15:00:40.427239619Z",
    "stream": "stderr",
    "log": "[2017/11/11 15:00:40] [error] [out_kafka] fluent-bit#producer-1: [thrd:kafka-1.broker.kafka.svc.cluster.local:9092/bootstrap]: kafka-1.broker.kafka.svc.cluster.local:9092/1: Receive failed: Disconnected\\n",
    "@timestamp": 1510412440.42724
  }

Notably, compared to filebeat 6.0.0-rc (Yolean/kubernetes-kafka#88), it contains node name and annotations. Also notably, like filebeat, no key is set.

solsson added a commit to Yolean/kubernetes-kafka that referenced this pull request Nov 11, 2017
but beware of the aggregation recursion with the test,
you'll see escaped escaped ... escaped json.
@edsiper
Copy link
Member

edsiper commented Nov 11, 2017 via email

@solsson
Copy link
Contributor Author

solsson commented Nov 11, 2017

Note: there is a Message_Key optional config parameter available

How can I use record fields with that option, for example namespace_name + _ + pod_name?

@edsiper
Copy link
Member

edsiper commented Nov 11, 2017

@solsson that feature is not yet available, meaning the "ability to compose a value based on record keys"

Match *
Brokers bootstrap.kafka:9092
Topics ops.kube-logs-fluentbit.stream.json.001
Timestamp_Key @timestamp
Copy link

@StevenACoffman StevenACoffman Dec 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an option called Retry_Limit set to False, that means if Fluent Bit cannot flush the records to Kafka it will re-try indefinitely until it succeeds.

What are your thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unaware of this but now I found fluent/fluent-bit#309 (comment). I guess the daemonset pods need some monitoring, like a gauge for the number of messages in buffer.

If one message fails it's quite unlikely that the next will be sent, so I think Retry_Limit=False makes sense to preserve ordering, in combination with a very narrow memory limit. It's a decent alternative to buffer monitoring, because if the pod hits the memory limit it will be replaced and generic monitoring on readiness (for example using prometheus-operator and this example) will alert about the issue.

@StevenACoffman
Copy link

StevenACoffman commented Dec 15, 2017

Hrm... @solsson with your recent changes to use the bootstrap service (and @edsiper adding the Retry_Limit False ) I'm getting occasional messages:

[2017/12/15 17:15:40] [ warn] [filter_kube] could not pack merged json
[2017/12/15 17:17:46] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
[2017/12/15 17:17:46] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected

I'm going to look into it a bit more with debugging on.

@solsson
Copy link
Contributor Author

solsson commented Dec 15, 2017

@StevenACoffman That's quite odd. Unless I'm mistaken Kafka producers will only use the boostrap servers string for the initial connect to brokers, then immediately get proper advertised addresses. I'm prepared to revert that change, but I would like to find the cause first.

@StevenACoffman
Copy link

@edsiper I'm not sure the Retry_Limit logic is working.

Just for testing, I locally reverted bootstrap service change in this pull request, but left my addition of the Retry_Limit false in (and log level set to debug, changed topic to k8s-firehose), and then I deleted the k8s-firehose topic. I then get a bunch of messages like:

logging/fluent-bit-8p6r6[fluent-bit]: % Failed to produce to topic k8s-firehose: Local: Unknown topic

After I recreate the topic in kafka (I have autorecreate disabled), the pods continue to complain indefinitely until I manually restart them. Then they behave again.

By the way, I'm using kail to follow this, and it's pretty great.
kail --ds=fluent-bit will watch the parallel streams in my cluster.

@solsson
Copy link
Contributor Author

solsson commented Dec 22, 2017

I've been running in a QA cluster for a couple of days with Retry_Limit false, without memory limits. Haven't had time to analyze results, but memory usage is suspiciously high on two pods now (45 and 60 MB while the rest are below 20 MB). I also have lots of logs like:

[2017/12/22 18:15:18] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
[2017/12/22 18:15:18] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
[2017/12/22 18:15:18] [error] [out_kafka] fluent-bit#producer-1: [thrd:kafka-1.broker.kafka.svc.cluster.local:9092/1]: kafka-1.broker.kafka.svc.cluster.local:9092/1: Receive failed: Disconnected
[2017/12/22 18:15:18] [error] [out_kafka] fluent-bit#producer-1: [thrd:kafka-1.broker.kafka.svc.cluster.local:9092/1]: kafka-1.broker.kafka.svc.cluster.local:9092/1: Receive failed: Disconnected
[2017/12/22 18:25:18] [error] [out_kafka] fluent-bit#producer-1: [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Receive failed: Disconnected
[2017/12/22 18:25:18] [error] [out_kafka] fluent-bit#producer-1: [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Receive failed: Disconnected
[2017/12/22 18:35:18] [error] [out_kafka] fluent-bit#producer-1: [thrd:kafka-1.broker.kafka.svc.cluster.local:9092/1]: kafka-1.broker.kafka.svc.cluster.local:9092/1: Receive failed: Disconnected
[2017/12/22 18:35:18] [error] [out_kafka] fluent-bit#producer-1: [thrd:kafka-1.broker.kafka.svc.cluster.local:9092/1]: kafka-1.broker.kafka.svc.cluster.local:9092/1: Receive failed: Disconnected
[2017/12/22 18:35:19] [error] [out_kafka] fluent-bit#producer-1: [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Receive failed: Disconnected
[2017/12/22 18:35:19] [error] [out_kafka] fluent-bit#producer-1: [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Receive failed: Disconnected
[2017/12/22 18:45:19] [error] [out_kafka] fluent-bit#producer-1: [thrd:kafka-1.broker.kafka.svc.cluster.local:9092/1]: kafka-1.broker.kafka.svc.cluster.local:9092/1: Receive failed: Disconnected
[2017/12/22 18:45:19] [error] [out_kafka] fluent-bit#producer-1: [thrd:kafka-1.broker.kafka.svc.cluster.local:9092/1]: kafka-1.broker.kafka.svc.cluster.local:9092/1: Receive failed: Disconnected
[2017/12/22 18:55:18] [error] [out_kafka] fluent-bit#producer-1: [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Receive failed: Disconnected
[2017/12/22 18:55:18] [error] [out_kafka] fluent-bit#producer-1: [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Receive failed: Disconnected
[2017/12/22 18:55:19] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected
[2017/12/22 18:55:19] [error] [out_kafka] fluent-bit#producer-1: [thrd:bootstrap.kafka:9092/bootstrap]: bootstrap.kafka:9092/bootstrap: Receive failed: Disconnected

I haven't seen any indication that our other Kafka clients have this much issues.

@edsiper
Copy link
Member

edsiper commented Dec 22, 2017

hmm those messages are generated by librdkafka, now using 0.11.1, I will upgrade to 0.11.3 (https://github.com/edenhill/librdkafka/releases/tag/v0.11.3)

@solsson
Copy link
Contributor Author

solsson commented Dec 22, 2017

@edsiper Is there any way for me to pass debug flags to librdkafka? With kafkacat it's very useful to add -d broker,topic if problems like these occur.

@edsiper
Copy link
Member

edsiper commented Dec 22, 2017

rdkafka updated on image fluent/fluent-bit-kafka-dev:0.5

@StevenACoffman
Copy link

StevenACoffman commented Dec 22, 2017

@edsiper @solsson I updated the Yolean/kubernetes-kafka:single-node branch and dumped the files from here in this branch of this repo from 7945b66 together into my own StevenACoffman/kubernetes-kafka#2

This lets me run in minikube a single node kafka cluster, with a fluent-bit daemonset for testing purposes. Hopefully helpful to one of you as well!

Unfortunately, I'm out for the rest of the year, so I'll not be able to give this more attention until Jan 2, 2018.

@solsson
Copy link
Contributor Author

solsson commented Dec 23, 2017

It's worth noting about the QA cluster I've been running on that it's multi-zone (Yolean/kubernetes-kafka#41). With log aggregation being a per-node affair, I wonder if kafka clients too can be "rack-aware". If I set acks to 1 it should be possible to produce to the broker in the same availability zone, if it's up and has the required partitions, and let kafka do the replication out. Now we will sometimes write to a broker that is close, sometimes to one that is farther away which will then replicate back to the closer broker.

solsson added a commit to Yolean/kubernetes-kafka that referenced this pull request Jan 8, 2018
but beware of the aggregation recursion with the test,
you'll see escaped escaped ... escaped json.
solsson added a commit to Yolean/kubernetes-kafka that referenced this pull request Jan 8, 2018
but beware of the aggregation recursion with the test,
you'll see escaped escaped ... escaped json.
@solsson solsson mentioned this pull request Jan 16, 2018
@solsson
Copy link
Contributor Author

solsson commented Jan 16, 2018

#16 replaces this

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants