Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark: HTTP2 fix cancel_stream_error #2612

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dsmiley
Copy link
Contributor

@dsmiley dsmiley commented Aug 1, 2024

Fixes Jetty HTTP2 cancel_stream_error by removing the flow control limit.

I was encountering this issue with CloudIndexing, esp when using single-node, single-replica, single-shard -- like 30K ops/sec. It'd routinely fail. I bumped this to 1000, then still hit it sometimes, then bumped to 2000. I haven't run it much since then to be honest. Then I saw how to disable it, which looks a little weird TBH (the default method returns a NO_RATE_CONTROL impl).

dsmiley added 2 commits August 1, 2024 16:36
Fixes Jetty HTTP2 cancel_stream_error by increasing the flow control limit.
jetty.http2.rateControl.maxEventsPerSecond=2000
Copy link
Contributor Author

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRAFT; I still see the error and I'm confused

@iamsanjay
Copy link
Contributor

Could you please walk me through the process of running this specific benchmark which is giving the error, such as a command, or anything which helps me re-produce it on my system.

@dsmiley
Copy link
Contributor Author

dsmiley commented Aug 15, 2024

crave run -- ./jmh.sh CloudIndexing -jvmArgsAppend -Dsolr.bench.seed=823716671780822 -p useSmallDocs=true -p nodeCount=1 -p numShards=1 -p numReplicas=1 -I 8

The benchmark module uses JettySolrRunner via MiniClusterState.

Googling for this is inconclusive as to the cause. It seems the server is choosing to cancel the client's stream, but we don't know why. I dug into this before and thought it was a rate limiting mechanism, so this PR disables that. Perhaps the next step to debug is to ascertain what jetty config/logging would explain that. Would want to avoid debug level on everything -- too much output.

@iamsanjay
Copy link
Contributor

I recall encountering a somewhat similar error in the IndexFetcher class when we transitioned to HTTP2. The IndexFetcher class was prematurely closing the client connection and sending a GOAWAY packet to the server. When the flow control limit was reached, it triggered an exception.

I suspect the same issue might be occurring in this case. Unfortunately, I can't run the crave script anymore since upgrading to the new Apple Silicon chip. I have to look into that.

@dsmiley dsmiley marked this pull request as draft August 22, 2024 19:16
@iamsanjay
Copy link
Contributor

Was there anyway to enable more logs for benchmark? I wonder If it's too much overhead with the current Jetty configuration.

org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:57165/solr/testCollection/update?wt=javabin&version=2: Async exception during distributed update: Error from server at http://127.0.0.1:57164/solr: null



request: http://127.0.0.1:57164/solr
Remote error message: Server error writing document id 5234 to the index.
        at org.apache.solr.client.solrj.impl.HttpSolrClientBase.processErrorsAndResponse(HttpSolrClientBase.java:280)
        at org.apache.solr.client.solrj.impl.Http2SolrClient.processErrorsAndResponse(Http2SolrClient.java:549)
        at org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:505)
        at org.apache.solr.bench.index.CloudIndexing.indexDoc(CloudIndexing.java:138)
        at org.apache.solr.bench.index.jmh_generated.CloudIndexing_indexDoc_jmhTest.indexDoc_thrpt_jmhStub(CloudIndexing_indexDoc_jmhTest.java:242)
        at org.apache.solr.bench.index.jmh_generated.CloudIndexing_indexDoc_jmhTest.indexDoc_Throughput(CloudIndexing_indexDoc_jmhTest.java:111)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:527)

@dsmiley
Copy link
Contributor Author

dsmiley commented Sep 19, 2024

Maybe you are unfamiliar with Log4j but look at log4j2-bench.xml in this module.

@iamsanjay
Copy link
Contributor

MaxConnectionsPerDestination 8

Benchmark               (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt      Score      Error  Units
CloudIndexing.indexDoc            4              1            4          50000            true  thrpt    4  16140.035 ± 1902.453  ops/s
CloudIndexing.indexDoc            4              1            4          50000           false  thrpt    4   1818.103 ±   42.214  ops/s
CloudIndexing.indexDoc            4              3            4          50000            true  thrpt    4   7280.289 ± 1774.366  ops/s
CloudIndexing.indexDoc            4              3            4          50000           false  thrpt    4   1546.771 ± 2012.373  ops/s

MaxConnectionsPerDestination 16

Benchmark               (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt      Score      Error  Units
CloudIndexing.indexDoc            4              1            4          50000            true  thrpt    4  16374.754 ± 1354.421  ops/s
CloudIndexing.indexDoc            4              1            4          50000           false  thrpt    4   2053.636 ±  625.117  ops/s
CloudIndexing.indexDoc            4              3            4          50000            true  thrpt    4   7919.480 ± 4504.359  ops/s
CloudIndexing.indexDoc            4              3            4          50000           false  thrpt    4   1760.070 ± 2226.050  ops/s

MaxConnectionsPerDestination 32

Benchmark               (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt      Score      Error  Units
CloudIndexing.indexDoc            4              1            4          50000            true  thrpt    4  15566.625 ± 1666.612  ops/s
CloudIndexing.indexDoc            4              1            4          50000           false  thrpt    4   2084.775 ±  239.749  ops/s
CloudIndexing.indexDoc            4              3            4          50000            true  thrpt    4   7505.612 ± 4698.342  ops/s
CloudIndexing.indexDoc            4              3            4          50000           false  thrpt    4   1684.325 ± 2826.760  ops/s

MaxConnectionsPerDestination 64

Benchmark               (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt      Score      Error  Units
CloudIndexing.indexDoc            4              1            4          50000            true  thrpt    4  15576.925 ± 1727.535  ops/s
CloudIndexing.indexDoc            4              1            4          50000           false  thrpt    4   2111.832 ±  189.408  ops/s
CloudIndexing.indexDoc            4              3            4          50000            true  thrpt    4   8069.933 ± 4096.516  ops/s
CloudIndexing.indexDoc            4              3            4          50000           false  thrpt    4   1706.464 ± 2694.685  ops/s

@iamsanjay
Copy link
Contributor

iamsanjay commented Sep 19, 2024

One thing I noticed that the moment I increased the maxConnectionPerDestination to 8 and keep on increasing till 64, I did'nt see any exception, able to run the full benchmark without any issue.

Below I ran it for 2 and first benchmark immediately failed!

MaxConnectionsPerDestination 2

Benchmark               (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt     Score      Error  Units
CloudIndexing.indexDoc            4              1            4          50000           false  thrpt    4  1730.811 ±   99.685  ops/s
CloudIndexing.indexDoc            4              3            4          50000            true  thrpt    4  8144.242 ± 7693.263  ops/s

@iamsanjay
Copy link
Contributor

MaxConnectionsPerDestination 128

Benchmark               (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt      Score      Error  Units
CloudIndexing.indexDoc            4              1            4          50000            true  thrpt    4  15310.374 ± 2095.507  ops/s
CloudIndexing.indexDoc            4              1            4          50000           false  thrpt    4   2066.279 ±   64.590  ops/s
CloudIndexing.indexDoc            4              3            4          50000            true  thrpt    4   7189.295 ± 5208.593  ops/s
CloudIndexing.indexDoc            4              3            4          50000           false  thrpt    4   1774.028 ± 2590.350  ops/s

@dsmiley
Copy link
Contributor Author

dsmiley commented Sep 19, 2024

Wow; good find! If setMaxConnectionsPerDestination(1) then what? I wonder what Jetty is supposed to do when this limit is approached -- like is cancel_stream_error expected in some sense or should it be doing something else? I suspect a Jetty issue.

I should mention that we'll somewhat soon in #2410 have maxConnectionsPerDestination set more appropriately (not 4!) and then this issue here maybe won't be experienced.

@iamsanjay
Copy link
Contributor

Actually I am still not convinced that issue is related to maxConnectionPerDestination setting. Because today When I ran it again and again, I found the issue with higher values as well such as 8, 16, 32. As few runs cannot tell which one is better, We can set up a dedicated benchmark for it to run it for different values of maxConnectionPerDestination and maxConcurrentStream (though this one is server level setting which I am not yet sure how we will configure in benchmark.)

There is a possibility that Jetty is cancelling stream because one of the shard went into recovery!

Server exception => org.apache.solr.common.SolrException: ClusterState says we are the leader (http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4), but locally we don't think so. Request came from http://127.0.0.1:54736/solr/testCollection_shard2_replica_n16/
        at org.apache.solr.update.processor.DistributedZkUpdateProcessor.doDefensiveChecks(DistributedZkUpdateProcessor.java:1144)
org.apache.solr.common.SolrException: ClusterState says we are the leader (http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4), but locally we don't think so. Request came from http://127.0.0.1:54736/solr/testCollection_shard2_replica_n16/

request: http://127.0.0.1:54734/solr
Remote error message: ClusterState says we are the leader (http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4), but locally we don't think so. Request came from http://127.0.0.1:54736/solr/testCollection_shard2_replica_n16/
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283) ~[solr-solrj-10.0.0
-SNAPSHOT.jar:10.0.0-SNAPSHOT 549b46da0c7030249c81f827b801bb24bd6b57cd [snapshot build, details omitted]]
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:184) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT 549b46da0c7030249c81f827b801bb24bd6b57cd [snapshot build, details omitted]]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:212) ~[metrics-core-4.2.26.jar:4.2.26]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$1(ExecutorUtil.java:449) ~[solr-solrj-10.0.0-SNAPSHOT.jar:10.0.0-SNA
PSHOT 549b46da0c7030249c81f827b801bb24bd6b57cd [snapshot build, details omitted]]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.base/java.lang.Thread.run(Thread.java:829) [?:?]

2024-09-20 21:58:32.200 ERROR (updateExecutor-38-thread-11-processing-testCollection_shard3_replica_n2 null-30262 core_node5 null 127.0.0.1:54735_solr testCollection shard3) [c:testCollection s:shard3 r:core_node5 x:testCollection_shard3_replica_n2] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient
          Error when calling SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=ForwardNode: http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4/ to http://127.0.0.1:54734/solr/testCollection_shard4_replica_n4/ => org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:54734/solr: null

This one looks different!

 Client exception => org.apache.solr.common.SolrException: can not use FieldCache on a field w/o docValues unless it is indexed uninvertible=true and the type supports Uninversion: _version_
        at org.apache.solr.schema.SchemaField.checkFieldCacheSource(SchemaField.java:295)
org.apache.solr.common.SolrException: can not use FieldCache on a field w/o docValues unless it is indexed uninvertible=true and the type supports Uninversion: _version_

@iamsanjay
Copy link
Contributor

We can test the setup where we are trying to index document and shard is down (rewriting the zookeeper cluster state)and see If we are getting the same cancel_stream error.

Also In some cases, I encountered an Out of Memory exception, suggesting we might need to scale it down a bit, as the current setup seems somewhat unstable.

@HoustonPutman
Copy link
Contributor

This one looks different!

 Client exception => org.apache.solr.common.SolrException: can not use FieldCache on a field w/o docValues unless it is indexed uninvertible=true and the type supports Uninversion: _version_
        at org.apache.solr.schema.SchemaField.checkFieldCacheSource(SchemaField.java:295)
org.apache.solr.common.SolrException: can not use FieldCache on a field w/o docValues unless it is indexed uninvertible=true and the type supports Uninversion: _version_

This error is not related, just make sure your version field has docValues enabled

@dsmiley
Copy link
Contributor Author

dsmiley commented Sep 20, 2024

There is a possibility that Jetty is cancelling stream because one of the shard went into recovery!

I'm surprised / suspicious app level concerns/errors (Solr returning an HTTP error to the client) would result in a lower level HTTP TCP stream cancellation of some kind.

@iamsanjay
Copy link
Contributor

We are using jetty 10, right?
jetty/jetty.project#10485

@dsmiley
Copy link
Contributor Author

dsmiley commented Sep 24, 2024

Yes; our versions are in versions.props

@iamsanjay
Copy link
Contributor

iamsanjay commented Sep 26, 2024

Latest Benchmark with different values for maxConnectionPerDestination

Benchmark               (maxConnectionPerHost)  (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt     Score      Error  Units
CloudIndexing.indexDoc                       1            4              3            4          50000           false  thrpt    4  1118.424 ±  847.096  ops/s
CloudIndexing.indexDoc                       2            4              3            4          50000           false  thrpt    4  1701.146 ± 1943.553  ops/s
CloudIndexing.indexDoc                       4            4              3            4          50000           false  thrpt    4  1726.712 ± 2448.684  ops/s
CloudIndexing.indexDoc                       8            4              3            4          50000           false  thrpt    4  1650.018 ± 2238.615  ops/s
CloudIndexing.indexDoc                      16            4              3            4          50000           false  thrpt    4  1651.497 ± 2117.253  ops/s
CloudIndexing.indexDoc                      32            4              3            4          50000           false  thrpt    4  1655.557 ± 2347.372  ops/s
CloudIndexing.indexDoc                      64            4              3            4          50000           false  thrpt    4  1695.647 ± 2139.217  ops/s
CloudIndexing.indexDoc                     128            4              3            4          50000           false  thrpt    4  1737.919 ± 2381.445  ops/s

It's a good improvement as we increase maxConnectionPerStream.

The pattern I see so far that If we increase the maxConnectionPerDestination then yes there are failures, but all of them are related to Out Of Memory exception rather than cancel stream error.

Clearly, increasing maxConnectionPerStream also requires more memory., which is understandable.

EDIT

Wait I was looking at the wrong value, actually there is no improvement as we increase maxConnectionPerDestination. Nothing!

@iamsanjay
Copy link
Contributor

There should be way to provide some sort of setting by which user can control this setting and then can play with it. We do have maxConnectionPerDestination but that is being used by http1 connection. And currently the value for that is 100000, which in my opinion would be too much for http2 as we are still experimenting with it.

We can also introduce another setting for http2 in UpdateShardHandler.

@iamsanjay
Copy link
Contributor

Yes; our versions are in versions.props

I inquired about the Jetty version because I posted a question regarding the cancel_stream error on the Jetty forum, and they informed me that Jetty 10 is at the end of the support and Please try Jetty 12 and report back if the problem still happens.

Note that Jetty 12 requires JDK 17!

@iamsanjay
Copy link
Contributor

Benchmark               (maxConnectionPerHost)  (nodeCount)  (numReplicas)  (numShards)  (preGenerate)  (useSmallDocs)   Mode  Cnt     Score      Error  Units
CloudIndexing.indexDoc                       1            4              3            4          50000           false  thrpt    4  1843.723 ± 2943.271  ops/s
CloudIndexing.indexDoc                       2            4              3            4          50000           false  thrpt    4  1756.116 ± 2323.291  ops/s
CloudIndexing.indexDoc                       4            4              3            4          50000           false  thrpt    4  1693.267 ± 2690.380  ops/s
CloudIndexing.indexDoc                       8            4              3            4          50000           false  thrpt    4  1598.918 ± 2505.014  ops/s
CloudIndexing.indexDoc                      16            4              3            4          50000           false  thrpt    4  1597.113 ± 2100.961  ops/s
CloudIndexing.indexDoc                      32            4              3            4          50000           false  thrpt    4  1694.114 ± 2307.802  ops/s
CloudIndexing.indexDoc                      64            4              3            4          50000           false  thrpt    4  1622.504 ± 2663.477  ops/s
CloudIndexing.indexDoc                     128            4              3            4          50000           false  thrpt    4  1633.025 ± 2277.204  ops/s
CloudIndexing.indexDoc                     256            4              3            4          50000           false  thrpt    4  1514.132 ± 2218.589  ops/s
CloudIndexing.indexDoc                     512            4              3            4          50000           false  thrpt    4  1645.532 ± 2375.068  ops/s
CloudIndexing.indexDoc                    1024            4              3            4          50000           false  thrpt    4  1590.969 ± 2128.108  ops/s

@HoustonPutman
Copy link
Contributor

Thanks for getting those numbers, very interesting that it had no effect.

@dsmiley
Copy link
Contributor Author

dsmiley commented Sep 27, 2024

This is a low priority thing; feel free to abandon this until we get to Jetty 12 and JDK 17. In the mean time, we know maxConnectionPerDestination is misconfigured to be too low; will be increased by default quite a bit -- Houston; I'm looking at you :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants