You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During load testing to determine achievable packet rates on new server configurations, we monitor the stats reported at /debug/stats/jvb/transit-stats, e.g.:
After we exceed the achievable packet rate for the server's capability, the average_delay reduces very slowly afterward. This is because the statistics are not calculated over a window, so the average includes every datapoint since the JVB was launched.
This limits the usefulness of this (otherwise hugely useful) statistic. Overall RTP & RTCP delay are probably the one statistic that captures bridge performance better than any other single stat, but unless they are calculated over a much shorter time window than "all time" they have limited usefulness for monitoring bridge performance, since a short spike in delay can continue to affect the value for a long time.
For example, one day after a fairly short load test in our lab during which the server's capacity was deliberately exceeded:
Actual RTP packet delay has returned back to much less than 1ms, since the server is unloaded now, but there is no way to see that from these stats, because the values are still affected by the values recorded during the load test and not many values have been recorded since.
Current behavior
Transit stats are calculated based on every data point since the JVB was started.
Expected Behavior
Transit stats should be calculated over a shorter time window ending at the present time.
Possible Solution
Implement windowing in org.jitsi.utils.stats.BucketStats. (Note: although that class is in a different repository than this issue, I am filing the issue here because JVB is where the lack of windowing has the noticeable impact.)
Steps to reproduce
While monitoring the values reported at /debug/stats/jvb/transit-stats, add traffic to the bridge until you exceed the server's performance capability. Remove the traffic and then observe that the transit stats take a long time to normalise. (If you overload the server by a large margin or for a long time, and the server is otherwise lightly loaded, they may not normalise for days.)
The text was updated successfully, but these errors were encountered:
Hey @jbg I understand your use-case. We solve the problem with somee glue between th bridge and the database. We query transit-stats periodically (once a minute in our case) and subtract the values from the previous run.
Correct, we found max and average not to be very useful. We ended up extracting % of packets delayed more than X ms from the buckets for our monitoring (for X = 5, 50, 500) and this is what we graph.
Thanks! We'll set up something similar for our metrics.
How expensive is the transit-stats calculation? the max and average (if windowed) could be nice to expose more 'publicly' on the health endpoint or similar. It would make things easier for people who just want a simple metric to scrape & graph.
Description
During load testing to determine achievable packet rates on new server configurations, we monitor the stats reported at
/debug/stats/jvb/transit-stats
, e.g.:After we exceed the achievable packet rate for the server's capability, the average_delay reduces very slowly afterward. This is because the statistics are not calculated over a window, so the average includes every datapoint since the JVB was launched.
This limits the usefulness of this (otherwise hugely useful) statistic. Overall RTP & RTCP delay are probably the one statistic that captures bridge performance better than any other single stat, but unless they are calculated over a much shorter time window than "all time" they have limited usefulness for monitoring bridge performance, since a short spike in delay can continue to affect the value for a long time.
For example, one day after a fairly short load test in our lab during which the server's capacity was deliberately exceeded:
Actual RTP packet delay has returned back to much less than 1ms, since the server is unloaded now, but there is no way to see that from these stats, because the values are still affected by the values recorded during the load test and not many values have been recorded since.
Current behavior
Transit stats are calculated based on every data point since the JVB was started.
Expected Behavior
Transit stats should be calculated over a shorter time window ending at the present time.
Possible Solution
Implement windowing in
org.jitsi.utils.stats.BucketStats
. (Note: although that class is in a different repository than this issue, I am filing the issue here because JVB is where the lack of windowing has the noticeable impact.)Steps to reproduce
While monitoring the values reported at
/debug/stats/jvb/transit-stats
, add traffic to the bridge until you exceed the server's performance capability. Remove the traffic and then observe that the transit stats take a long time to normalise. (If you overload the server by a large margin or for a long time, and the server is otherwise lightly loaded, they may not normalise for days.)The text was updated successfully, but these errors were encountered: