-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QueuedThreadPool increased thread usage and no idle thread decay #4105
Comments
That stack trace shows a thread idle in the pool. How many cores in your machine? 2300 seems a too large number, but may be due to your application allocating perhaps What happened on 13th? You reverted to 9.4.7? |
Are you using the If you are using a Java |
Physical CPU cores: 32
I'm not awere that it allocates HttpClient instances. I'll investigate it further to be sure.
Yes, on 13th I reverted back to 9.4.7.
Yes, I'm using the QueuedThreadPool with following configuration:
I have Java Flight Recodings before and after the Jetty update. Maybe it can be helpful. |
Interesting?!?!? We have done some work on the QueueThreadPool, but if anything we believe it should result in less idle threads rather than more. With so many idle threads and your configured timeout of 60s, I would expect to the number of idle threads slowly decay over time... but that doesn't appear to be the case. So something must be momentarily using all those threads at least once per minute OR we have a bug were we prefer to create a new thread rather than one that is idle (or just about to become idle). I can't see anything particularly wrong with your QueueThreadPool configuration (well I don't like the bound on its size, as the failures you get with a bounded threadpool queue are no better or less random than out of memory exceptions.... but at least you have a large bound). However, just to remove doubt about that, could you run with a configuration of minThreads = 8;
maxThreads = 4096;
threadPool = new QueuedThreadPool(maxThreads, minThreads, 60 000);
threadPool.setName("jetty");
threadPool.setDaemon(true); Note that I have double maxThreads, because if the problem persists (as I suspect it will) I want to see if it goes well beyond 2048 - ie is the 1912 some natural level, or was it just bouncing off maxThreads (hopefully you have the memory for that?) It would then also be good to run with the idle time reduced to 10s, just so we can see if there is any reduction and how frequent are the peaks that are creating more idle threads. Seeing a plot of thread count with 1sec resolution would be handy. |
Could the problem be caused by following scenario? In QueueThreadPool.ensureThreads() method: But it does not change the idle count itself. Instead it relies on one of the spawned threads to increase the idle count "addCounts". This introduces a race condition between the spawning loop and the "Runner" code that is executed in the threads So depending on when the first threads are executed that loop might already have started dozens or even hundreds of threads before enough of them are executed to bring the count back above 0. |
I ran the tests with suggested configuraiton:
The number of threds increased to 4400 (3636 were in TIMED_WAITING state). I noticed there was an spike in active threads at the process start. With idle timeout of 1 min and 10s the number of threads didn't decreased over time. But with idle timeout of 1s the number of threads fall over time to ~1700. |
Looking at the "1s test run", it seems like each peak in the "Threads" graph is correlated with a peak of "Active Threads" in the "Webserver" graph. Out of curiosity, what is your configured selector count on this 64 core machine? |
@gregw I think @AdrianFarmadin has a point in his analysis about the race. Thoughts? |
@sbordet Agreed that looks like it could indeed be a problem.... pondering! |
Signed-off-by: Lachlan Roberts <[email protected]>
Signed-off-by: Lachlan Roberts <[email protected]>
…d in QTP (#4118) * Issue #4105 - starting a thread in QTP now increments idle count Signed-off-by: Lachlan Roberts <[email protected]> * Issue #4105 - improve comments in test Signed-off-by: Lachlan Roberts <[email protected]>
Merged PR #4118 to address idle count issue pointed out by @AdrianFarmadin |
I tested the fix from PR #4118 and there is an improvement. The thread count is now around 3700 and there is no spike in active threads on startup.
|
@AdrianFarmadin so if you set an aggressive idle timeout for threads (e.g. 5 seconds), do you see the thread count to be constrained to a lower number? Point being that if you have a moderate activity on the server, every thread may be "touched" before it expires, so that a single spike may create a lot of threads that then will never expire - each will do a little work and pause for a long time, but not enough to be timed out. Setting a more aggressive idle timeout will reduce the likelihood of this situation. |
Note that the default state of the The issue may be that on your JVM/OS the implementation of the lock is actually being fair (as it is allowed to be), and waking up the longest waiting thread.... thus effectively touching the most idle thread and preventing shinkage. If this is the case, there is not much we can do about it, other than introducing a slower threadpool that implements a stack of idle threads.... or perhaps we can implement our own lock that is deliberately unfair and always wakes up the most recent waiter? But for now, the main question we need to answer, is were all 3700 threads actually non idle at the same time? Did you have enough load so that 3699 threads were busy and that starting the 3700th was the correct thing to do? |
Ignore me about unfair |
@AdrianFarmadin if you'd like to be experimental, perhaps you could try running with the following extension to QueuedThreadPool: public class ReservedQueuedThreadPool extends QueuedThreadPool
{
public ReservedQueuedThreadPool(int maxThreads, int minThreads)
{
super(maxThreads, minThreads);
setReservedThreads(maxThreads / 2);
}
@Override
public void execute(Runnable job)
{
if (!job.getClass().getSimpleName().equals("ReservedThread") && tryExecute(job))
return;
super.execute(job);
}
@Override
public void setIdleTimeout(int idleTimeout)
{
ReservedThreadExecutor reserved = getBean(ReservedThreadExecutor.class);
if (reserved != null)
reserved.setIdleTimeout(idleTimeout, TimeUnit.MILLISECONDS);
super.setIdleTimeout(idleTimeout);
}
} The idea here is that this is a "idle" threads will first become idle reserved threads, then normal idle waiting for jobs on the real job queue. Note the simpleName test and the half maxThread size are just hacks to test the concept of giving priority to recently active thread to help shrink the thread pool. It may also be beneficial as recently active threads are more likely to be running CPU cores with hot code/data caches. @sbordet thoughts? |
I did some more testing based on comments above. All tests were done with same load and hardware.
Fix from PR #4118 + 5 seconds idle timeout for threads
Fix from PR #4118 + 5 seconds idle timeout for threads +
In version 9.4.7.v20170914 there was nearly all the time 75 active threads. After update to 9.4.20.v20190813 the number of active threads increased to 186 with some spikes up to nearly1000. The fix from PR #4118 had no influence on count of active threads. In the test for PR #4118 + 5 seconds idle timeout for threads + |
Thanks for the detailed info. @sbordet interesting that without the reserved thread executor the OP is still seeing spikes in active threads, but that the hack removes those spikes. That makes me think that on this machine/load, the QTP is still favouring new or long term idle threads rather than recently active threads. I think there could be benefit in using reserved thread for normal dispatch to try to favour recently active threads. This can be done as per the coffee snippet above (extra reserved executor) or I have a branch where we can use the existing reserved threads on the QTP. I'll push that later today. |
The spikes are removed only for 5 seconds idle timeout. If I use 1 minute idle timeout, then the spikes occur again and the active thread count is around 100. The spikes with Test for PR #4118 + 1 minute idle timeout + |
@AdrianFarmadin in your comment with the various configurations and their graph, there is one where you say "Fix from PR #4118 + 5 seconds idle timeout for threads + ReservedThreadExectutor" and the graphs shows about 800-1000 threads (third graph from the top). The 2 graphs are quite different. What I am missing? Do you have a reproducer that we can try ourselves? |
@gregw I don't know. Having said that, the @AdrianFarmadin how do you measure "Active Threads"? Is there any possibility that the value from the 2 different Jetty versions is not measuring the same quantity? Also, what kind of traffic do you have? I ask because in 9.4.9 we changed the way we create and destroy |
@sbordet I think the difference is that the first few graphs are for total threads (including idle), the last few graphs are for active threads only. |
I tested the fix in #4146 over longer timeframe and the results are still different as in 9.4.7.v20170914. I also found out that the regression started in jetty-9.4.8.v20171121. |
So the fix has remedied the idle thread problem, as we see idle thread decay for both timeout settings. The difference is not so much idle threads, but from 9.4.8 there are spikes in active threads that are not seen in 9.4.7. The bug in 9.4.20 was that we were racing when creating new threads, but they were not active threads and I don't think we see that bug in the active thread spikes. So the question is - are these active thread spikes real or not? is these spikes due to a change in the way we offer load to the QTP or a change in the way the QTP handles load? If we have changed the QTP, so that an active thread took more time or was excluded from becoming idle, then that could cause more threads to be created. Or perhaps we are just offering more load? Issues from 9.4.8 (a big release) that we need to examine include:
These look like there are changes that may affect QTP behaviour, the offered load and the fairness/behaviour of contended locks on the QTP. We will need to research.. @AdrianFarmadin can you confirm/deny if you are seeing any quality of service changes? Are requests still being handled with similar latency? is max throughput about the same? @sbordet I have reopened this issue for now... but I think this probably deserves a new issue... but thought we'd do some research first before deciding. |
Are there any built-in Jetty metrics I can use? |
@AdrianFarmadin The built in jetty metrics are from the StatsHandler, but they will be measuring the latency within handlers and thus exclude some of the vital accepting/selecting/scheduling/flushing parts of QoS that really can only be measured externally. Unfortunately our external own CI load test results were only recorded permanently from 9.4.8. @olamy can you run up our load tests at several rates of load so we can compare 9.4.7 and 9.4.8 just to make sure there is no obvious change in capacity. If possible, could you also measure idle threads and active threads during those tests - looking for any spikes that match what @AdrianFarmadin is seeing. My current thinking is that that changes we made in 9.4.8 were as a result of a client pushing the server to extreme loads where we were suffering from live locks and unfairness issues at very high request rates. I believe we had to make some scheduling changes that may favour executing a jobs over a single thread iterating on them. This could explain the spikes in used threads observed. It would also be really valuable if you could collect information from the reserved thread pool as well. In 9.4.22 that is jmx org.eclipse.jetty.util.thread:type=reservedthreadexecutor,id=0 and looking at available, capacity and pending threads there would be good information to have. Failing that... or perhaps anyway, I think we should write a special version of the QTP that detects spikes of new thread creation and records why and who scheduled the job that caused it. That will tell us a lot about why you are seeing these spikes. |
@AdrianFarmadin is your testing using persistent connection ? One big difference we introduced in 9.4.8 was that we now execute the creation of new connections and closing old ones. Previously we did this in the selector thread, but that had issues as it can take some time... and for some protocols it can call user code. Thus it is plausible that the active thread spikes are related to connection spikes that were previously handled by the selector threads. I think the next step is definitely an instrumented QTP to record who/why threads are started and we can confirm/deny this theory. This will take me a day or two. If it is this, then I do not believe this is a problem. Running more threads is still below the total memory commitment represented by max threads and will be giving you a fairer less DOS vulnerable server. |
@gregw tests are running now to collect datas |
Yes, all connections are persistent. |
@AdrianFarmadin Is this live traffic or unit testing? The mean request time is a longer in 9.4.x at 333ms vs 259. Could that be natural variation in the load/processing or is that something that is repeated on all runs? Another big difference is the max open connections at 28211 vs 17972, which means that at one point 9.4.x had to deal with 10k more open connections, which means 2*10k more dispatched jobs, which is plausibly the source of the active thread spikes. I think the total connections stat is wrong - it looks like the current connections. Is that an error in your stats collection (wrong jmx name) or is our stats collection not working for you? |
What stands out is the number of concurrent dispatches, 1558 vs 180 and 555 vs 139. The server seems runs at ~4000 requests/s and 9.4.x seems to be a lot more concurrent and with less latency than 9.4.7. That could explain the additional thread usage. @gregw thoughts? |
We dispatch more, so we should expect more threads. |
I recently updated my Jetty version from 9.4.7.v20170914 to 9.4.20.v20190813 and I noticed increase in JVM threads count from 410 to 2300. The update was done on test environment with stable load.
According to thread dump there is 1912 threads in TIMED_WAITING state.
The active threads were also affected.
Is this a regression in new version?
The text was updated successfully, but these errors were encountered: