Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP integrate linkedin/cruise-control, fixes #100 #218

Merged
merged 19 commits into from
Jan 6, 2019
Merged

WIP integrate linkedin/cruise-control, fixes #100 #218

merged 19 commits into from
Jan 6, 2019

Conversation

double16
Copy link
Contributor

@double16 double16 commented Nov 1, 2018

linkedin/cruise-control automates much of the operations of Kafka.

I present this PR for discussion.

@solsson
Copy link
Contributor

solsson commented Dec 19, 2018

What remaining work do you see? It's a clean opt-in that's confined to the cruise-control. Are there risks with installing cruise-control?

@double16
Copy link
Contributor Author

cruise-control is conservative with making changes to the cluster, so it looks safe. I can't think of remaining work, but I can't say I'm an expert so I marked as WIP. If no one has objections then merge?

@solsson
Copy link
Contributor

solsson commented Dec 19, 2018

How about we merge to master instead of 4.x, or are we lacking compatibility with Kafka 2.x?

@double16
Copy link
Contributor Author

master sounds good to me. Kafka 2.x is supported with this version of cruise-control.

@solsson solsson changed the base branch from 4.x to master December 19, 2018 20:36
@solsson
Copy link
Contributor

solsson commented Dec 19, 2018

I get a crashloop after merge to master:

288    [           main] WARN  lients.consumer.ConsumerConfig  - The configuration 'sample.store.class' was supplied but isn't a known config.
288    [           main] WARN  lients.consumer.ConsumerConfig  - The configuration 'default.goals' was supplied but isn't a known config.
289    [           main] INFO  fka.common.utils.AppInfoParser  - Kafka version : 2.0.0
290    [           main] INFO  fka.common.utils.AppInfoParser  - Kafka commitId : 3402a8361b734732
415    [           main] INFO  .apache.kafka.clients.Metadata  - Cluster ID: n3QC-MxtQZeaRQG7avwTxQ
Exception in thread "main" java.lang.IllegalStateException: Cruise Control cannot find sampling topic matches __CruiseControlMetrics in the target cluster.
	at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsReporterSampler.configure(CruiseControlMetricsReporterSampler.java:207)
	at com.linkedin.kafka.cruisecontrol.config.KafkaCruiseControlConfig.getConfiguredInstance(KafkaCruiseControlConfig.java:1076)
	at com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager.<init>(MetricFetcherManager.java:129)
	at com.linkedin.kafka.cruisecontrol.monitor.sampling.MetricFetcherManager.<init>(MetricFetcherManager.java:69)
	at com.linkedin.kafka.cruisecontrol.monitor.task.LoadMonitorTaskRunner.<init>(LoadMonitorTaskRunner.java:74)
	at com.linkedin.kafka.cruisecontrol.monitor.LoadMonitor.<init>(LoadMonitor.java:157)
	at com.linkedin.kafka.cruisecontrol.monitor.LoadMonitor.<init>(LoadMonitor.java:113)
	at com.linkedin.kafka.cruisecontrol.KafkaCruiseControl.<init>(KafkaCruiseControl.java:106)
	at com.linkedin.kafka.cruisecontrol.async.AsyncKafkaCruiseControl.<init>(AsyncKafkaCruiseControl.java:67)
	at com.linkedin.kafka.cruisecontrol.KafkaCruiseControlMain.main(KafkaCruiseControlMain.java:71)

Could it be because of #221?

@double16
Copy link
Contributor Author

Yes, it's because of #221. If cruise-control supports creating the topic it isn't obvious. May need to do that in the init container.

solsson added a commit to solsson/dockerfiles that referenced this pull request Dec 19, 2018
in Yolean/kubernetes-kafka#218

Because downloading from an external server at broker pod start is a risk.
@solsson
Copy link
Contributor

solsson commented Dec 20, 2018

Added topic creation. I think we should take a look at the WARNings in logs. A lot about unused config.

What's a good test to see if Cruise Control is operational?

@double16
Copy link
Contributor Author

Thanks! I was looking into that but hadn't gotten there yet. I accidentally committed version 2.0.17 of cruise control. I hadn't tested that thoroughly yet, it seems it's more than a bug fix release. Can you change that to 2.0.6 ? You can either use a proxy or change the service to a load balancer and hit http://localhost:8090/kafkacruisecontrol/state .

@double16
Copy link
Contributor Author

Regarding the warnings, yes, we should look into those. I noticed those when working on 2.0.17. I always noticed the web server config isn't be recognized. IIRC, 2.0.6 has a much cleaner log.

@solsson
Copy link
Contributor

solsson commented Dec 20, 2018

I'd like to start with the latest version. Support for 2.x seems to be a WIP so we'll need to follow new releases I think. The warnings don't block merge, we can look at them once up and running. I'll try 2.0.17.

For the usage there should be a README in the cruise-control folder. We can start small. What I thought was that it would be great to prove if some essential automation use case is working. Under replicated partitions etc.

@solsson
Copy link
Contributor

solsson commented Dec 20, 2018

Are you ok with the last three commits in https://github.com/Yolean/kubernetes-kafka/compare/cruise-control-reconfig?expand=1 ?

I found a really bad effect of keeping patch files this way, same as with ./metrics/ probably. I wanted to uninstall cruise-conrol so I did kubectl delete -f cruise-control/ and got this statefulset.apps "kafka" deleted 😬 . I'm still in dev of course.

Update: pushed f21d15b to address that. Now it's four new commits.

@double16
Copy link
Contributor Author

Yes, those commits are better. Especially preventing delete! Would you like me to work on the README?

@solsson
Copy link
Contributor

solsson commented Dec 21, 2018

I've re-enabled self healing (da75c0a) and I'm running cruise control now in a test cluster. I think what remains is:

  • The readme, as you suggested
  • Can we get rid of patching on a CLASSPATH env? Not priority because the extensions concept is quite elegant and allows other addons as well. If this is how CLASSPATH is meant to be used with kafka, we should move the extensions mechanism to the core manifest. It's an emptyDir so I see no risk of unintended concequences.
  • Re-build the cruise-control image with the latest VERSION + based on https://github.com/solsson/dockerfiles/blob/master/kafka/Dockerfile#L2
  • Resource limits. If we go with the JDK 11 base image.

@solsson
Copy link
Contributor

solsson commented Dec 21, 2018

Actually it'll be a bit of a pain to keep the two image in sync, the one with the metrics jar and the actual cruise control image. Maybe https://github.com/Yolean/kubernetes-kafka/pull/218/files#diff-36facb0f724a84076b8a2525be1c081aR11 should be the actual cruise control image. Should I add the cruise control build to https://github.com/solsson/dockerfiles?

@double16
Copy link
Contributor Author

I think moving the cruise-control image to https://github.com/solsson/dockerfiles is a good idea for maintenance reasons. (FYI, cruise-control 2.0.18 has been released). There isn't a good reason to keep a separate cruise-control and cruise-control-reporter image, I say pull the report jar from the cruise-control image.

As to the CLASSPATH var... yeah, that's how the Kafka script is written. Providing the extensions dir up a level would be great for other integrations.

I've started a README, let me know which of these things you'd like me to work on so we're not duplicating effort.

@solsson
Copy link
Contributor

solsson commented Dec 21, 2018

I interpret https://github.com/apache/kafka/blob/2.1/bin/kafka-run-class.sh#L129 as wildcard being supported for CLASSPATH, so I'll go for the generalization of the extensions mechanism.

@solsson
Copy link
Contributor

solsson commented Dec 21, 2018

Currently blocked by solsson/dockerfiles#20 (comment). Better luck tomorrow maybe.

@solsson
Copy link
Contributor

solsson commented Dec 27, 2018

After 432f1e8 and 7cf81da all that remains is testing. The new image build is completely untested.

@double16
Copy link
Contributor Author

I'm working on testing replacement of broker nodes.

@double16
Copy link
Contributor Author

I've tested starting with 3 nodes, then scaling up and down. Cruise control re-assigns partitions to balance as expected. When scaling down to 2 it doesn't do anything, but that may be a config item where the min number of nodes is 3.

@solsson
Copy link
Contributor

solsson commented Jan 2, 2019

I haven't done any real testing, but I'm fairly confident that the addition of the extensions mechanism to core Kafka is low risk. The rest is opt-in. Before merge I'd like us to consider adding a bit more to the readme:

@double16
Copy link
Contributor Author

double16 commented Jan 5, 2019

I added to the README, what do you think?

@solsson solsson merged commit c805f9b into Yolean:master Jan 6, 2019
@solsson solsson mentioned this pull request Jan 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants