Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase TLS passthrough buffer size from 4k to 16K #9540

Closed
scholzj opened this issue Jan 24, 2023 · 10 comments · Fixed by #9548
Closed

increase TLS passthrough buffer size from 4k to 16K #9540

scholzj opened this issue Jan 24, 2023 · 10 comments · Fixed by #9548
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@scholzj
Copy link

scholzj commented Jan 24, 2023

What happened:

In Strimzi with support for the Ingress Nginx controller as one of the ways to access Apache Kafka from outside of Kubernetes. Since Apache Kafka has its own binary protocol not based on HTTP, we have to use TLS-passthrough for it. For a long time, this worked fine and it is relatively popular among our users (BTW: Thanks for all the effort you put into this project).

But recently, we moved from Java 11 to Java 17 as the runtime for our Apache Kafka clusters. Java 17 seems to change slightly the TLS behavior. In particular, the applications using Java 17 seem to try to reestablish the TLS sessions when possible and they use for it a relatively big session ticket which was not used with Java 11. You can see an example of such TLS ClientHello here.

It looks like the Nginx Ingress controller handles the TLS Passthrough itself in the tcp.go file and according to the tcp.go it seems to allocate only 4096 bytes to get the ClientHello, decode it, get the TLS-SNI hostname from it and forward the connection. But when the ClientHello is too big and does not fit into the 4096 bytes, it fails to decode the TLS-SNI hostname and simply forwards the connection to the default backend instead of treating it as TLS Passthrough and forwarding it to the appropriate service.

In the controller logs, I can see something like this for the initial connection which is successful because the first ClientHello is small:

I0123 22:24:58.729136       7 nginx.go:790] "Handling TCP connection" remote="192.168.1.86:4373" local="172.16.41.189:443"
I0123 22:24:58.744848       7 tcp.go:73] "TLS Client Hello" host="broker-0.p50"

But for the follow-up connections which try to reuse the session and send the large ClientHello with the session ticket, I get only something like this when it forwards the connection to the default backend:

I0123 22:25:01.894653       7 nginx.go:790] "Handling TCP connection" remote="192.168.1.86:41952" local="172.16.41.189:443"
I0123 22:25:01.907901       7 tcp.go:101] "Writing Proxy Protocol" header=<
	PROXY TCP4 192.168.1.86 172.16.41.189 41952 443
 >

To double-check the idea that the 4096 bytes are the issue, I tried to build my custom build of the controller which changes the buffer size from 4096 bytes to 16384 bytes. And that indeed seems to address the problem.

TBH, I'm not a TLS expert. So I do not really know if the 16384 bytes work for every case or if someone would need it bigger. But they helped my applications (I tried 8192 as well, but that was not enough).

Would it be possible to increase the value of the buffer for the TLS passthrough feature? Or maybe make it configurable if you are afraid that increasing it by default?


What you expected to happen:

That the TLS passthrough feature works even with Java 17 / applications that send ClientHello larger than 4096 bytes.


NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): 1.5.1
Kubernetes version (use kubectl version): 1.26.0

Environment:

  • Cloud provider or hardware configuration: Bare-metal
  • OS (e.g. from /etc/os-release): CentOS
  • Install tools: Kubeadm
  • How was the ingress-nginx-controller installed: Single instance per cluster installed from my own YAML files.

How to reproduce this issue:

If needed, I can provide a guide on how to install Strimzi and Apache Kafka to reproduce it. Or I can try to create some more simple reproducer. But I was kinda hoping that the explanation above is sufficient. Please let me know.

@scholzj scholzj added the kind/bug Categorizes issue or PR as related to a bug. label Jan 24, 2023
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 24, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@longwuyuan
Copy link
Contributor

@scholzj , thank you very much for details of the issue. It helps a lot.

  • The project is in a stablization phase
  • There are several reasons for attempting to stabilize the project
  • One of the reasons is lack of resources like developer time
  • Another reason is balancing the complexity between factors like changes in K8S KEP, Changes in upstream Nginx itself, in addition to Security, Performance etc
  • Problems like nobody available to work on support & maintaining fringe-use/corner-case-use features & functions
  • So there is no activity on new features and most changes are critical bug fixes aor stabilization work

Secondly, there is a plan to re-implement TLS passthrough and related functionality. It is to be attempted sooner than later. I am not certain if this will impact your use-case but I think it likely will.

I would recommend that much more information be exchanged. Please feel free to attend the community meetings deails here https://github.com/kubernetes/community/tree/master/sig-network

16K and passthrough means someone needs to authoritatively talk about potential of undesired payload in that 16K and other stuff that is specific to TLS and the current passthrough implementation. Nginx seems to having a native streaming ability.

/remove-kind bug
/kind feature

/retitle increase TLS passthrough buffer size from 4k to 16K

@k8s-ci-robot k8s-ci-robot changed the title TLS Passthrough mode is not compatible with applications running on Java 17 increase TLS passthrough buffer size from 4k to 16K Jan 24, 2023
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jan 24, 2023
@scholzj
Copy link
Author

scholzj commented Jan 24, 2023

@longwuyuan Thanks for the response. TBH, I managed to debug it up to identifying that the buffer is the problem here. But I'm not a TLS expert or Ingress expert to judge the consequences of increasing the buffer. And as I said, I have no idea if tomorrow someone will not come and say it needs to be 32K or something like that.

If you would decide that making it configurable to offload the decision to the users is the right way, I can try to contribute it. But if a rewrite of the TLS passthrough is planned anyway, I understand that such a solution might not be desired.

In the meantime, it seems that setting the Java system property jdk.tls.server.enableSessionTicketExtension to false on the server side will disable the TLS feature which is causing this problem. From my point of view, it is more a workaround than a solution - so I do not think it solves this issue which I guess sooner or later will come also from other servers and not only form Java. But I wanted to mention it here if someone runs into this issue.

@longwuyuan
Copy link
Contributor

That is ok because you have presented a genuine problem that is being worked on. Please do keep track of project changes and hopefully join a community meeting. Once the actual re-implementation of the TLS passthrough function is underway, you will see a related PR.

@strongjz
Copy link
Member

I don't know the ramifications of changing this so I think a good compromise is to make it configurable by the end user and not hard set at 4k.

@strongjz
Copy link
Member

Ok it seems to make sense to make it 16k by default but I still like the idea to make configurable.

Byte 0 = SSL record type
Bytes 1-2 = SSL version (major/minor)
Bytes 3-4 = Length of data in the record (excluding the header itself).
The maximum SSL supports is 16384 (16K).

@scholzj
Copy link
Author

scholzj commented Feb 2, 2023

Thanks for fixing this! 😍

@nkostoulas
Copy link

nkostoulas commented Nov 28, 2023

In the meantime, it seems that setting the Java system property jdk.tls.server.enableSessionTicketExtension to false on the server side will disable the TLS feature which is causing this problem. From my point of view, it is more a workaround than a solution - so I do not think it solves this issue which I guess sooner or later will come also from other servers and not only form Java. But I wanted to mention it here if someone runs into this issue.

@scholzj is that on the Strimzi brokers? We've added this and still seeing the issue 😞 Is there any further change required to the Java clients as well or is it only the brokers that require this?

Our config looks as below:

    jvmOptions:
      -Xms: 4G
      -Xmx: 4G
      javaSystemProperties:
        - name: jdk.tls.server.enableSessionTicketExtension
          value: "false"

@scholzj
Copy link
Author

scholzj commented Nov 28, 2023

@nkostoulas To be honest, I do not remember it in detail anymore as it has been fixed for some time now and I did not need to use it anymore. As far as I remember, it was on the broker side, I do not think I had to do anything else on the client side.

@nkostoulas
Copy link

I see, thanks @scholzj .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants