-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large amount of ram used by MBassador #1
Comments
All charts show consistent read performance (Handler Invocation). Write performance varies significantly (high std. deviation) which can be explained by unpredictability in thread scheduling related to synchronized resources. Memory consumption is significantly lower for limited queue size but comes with a decrease in performance. Disruptor is considerably slower (explanation?). Processing is not constant but shows hotspots (clusters of data points). Memory consumption is significantly lower which is partly due to lower throughput. @dorkbox Any explanations for the low performance of disruptor? Maybe its model does not match the scenario?! Or its not correctly used. What about a custom ring buffer with an atomic index. I already build that once and it wasn't very difficult and gives a constant memory footprint with low synchronization overhead. Maybe an array blocking queue would be a viable alternative, too?! And many thanks to your great work. It is really interesting to see those figures. I once used JProfiler to trace thread activity (running,waiting,blocked...) to compare MBassador with Guava. I never posted the results (which I should do, I realize). But this could be helpful in figuring out why Disruptor is so much slower. It probably does work that is unnecessary (after all there are not many different consumers/producers). |
I'm not sure why it's so slow - I think that it has to do with queueing messages. I think that the benchmark measure how fast messages can be queued, not necessarily how long it takes to execute each message. I'm experimenting with ring buffers and object pools to see how that affects the performance. I must admit I'm rather disappointed in Disruptor's performance. I did read that the primary use for the disruptor is for handling (fixed lengths) bytes of data. What was interesting, was that having a larger ringBuffer (ie: anything larger than 16) also resulted in progressively worse performance -- and it doesn't make any sense to me why. think that a larger ring buffer would help queue entries. |
Hmmm. Why is the Disruptor so slow in comparison? I was so intrigued by their idea, the paper and the many interesting blog posts by Martin Thompson. There must be something wrong in the way it is used. |
Indeed. Their idea, paper, blogs, videos, etc looked really, really good. I think the correct way to use the disruptor is only to hand off data, not to actually process it. I've banged my head on the wall for a while, and I have no idea what I could possibly be doing incorrectly --- so what I've done is moved on to something else. I'm using the MPMCqueue from here: http://psy-lob-saw.blogspot.de/2015/01/mpmc-multi-multi-queue-vs-clq.html Moving to a "dispatch" queue and a "invoke" queue has really helped performance, and using limited queue sizes really helps keep the memory footprint down. For funsies, here's using a LinkedTransferQueue + MPMC queue. |
Wow! LinkedTransferQueue is outperforming the rest of the candidates by far in terms of throughput. So, when you need low memory footprint you should take MPMC Queue which will still give good performance (probably best in terms of memory:throughput ratio). If memory is not so much of a concern, then LTQ is the best option. This is quite an insight. I will definitely bring that into the next release. As for the reflection part: I am still convinced that reflective invocation gets completely optimized away on hotspots and as it is JDK standard I think it is good to keep it the way it is. It is stable and well performing. Do you have other code optimizations/refactorings that are compatible with the current code base that you would like to see in there? Maybe we could collaborate on this project and continue development in on place? |
Based on the performance results, I'm also convinced there is no need for reflectASM -- at least for method access. For field access I have no idea. I'll play around with inflation on the JVM to see what that does too. I'll work on getting code into a compatible state the changes based on your master. I did have to make some small changes to the testing framework (adding colors and memory usage, for example), so I'll clean that up and and put in a pull request. It will be a few days though, since I'm swamped with work. |
more info on LTQ performance: http://php.sabscape.com/blog/?p=557 |
turns out, i think i figured out the Disruptor. It is REALLY fast and handing off data. really, REALLY fast. The downside, is that if it does any work on that data, I have yet to see it go fast. (as shown earlier) My test used Disruptor to hand data off to an executor. Wow. The downside, is that the LTQ that it handed data off to grew to about 2 gigs. Not what we want, but interesting. |
I did some benchmarks (notably from here, because of how misinformed answers are online). reflective invocation (without setAccessible) 182.837 ns |
Looks like another tweak that can easily be integrated. The good news is that it doesn't require any external library like reflectASM. Only limitation is that it is available only in Java >= 7. |
Maybe, the methodHandle has to be "static final", which means it's tricky. I think reflectASM is the only way to get it down that low, however, 1.7 ns is really, really fast. I'm currently working on some data structures to get object creation down too -- there's a lot going on. Here's the latest (but not backported yet) -- the timing isn't quite what I want yet, what do you think? It turns out the memory usage isn't quite accurate (it's also measuring the test framework) - I've done some profiling in VisualVM to make improvements |
Looks like you are making good progress. How do we go about bringing this back into the core of MBassador? I am currently working on some minor tickets which are mainly about configuration, error handling and stuff. I am also thinking about changing to LTQ and MethodHandle. Have you tweaked other parts? |
I've done some more tweaks (faster iteration over collections) and I'm working on managing memory more efficiently (caching subscriptions for superClasses). The only way MethodHandle helps is if it's "static final", which is impossible to do for dynamic method invocation (where JIT will inline the method call). Reflection via "Method" (as it currently is), or RelflectASM are the only performant options from what I can tell. Also, I can make the changes to LTQ (i've already got the source included, so it'll work on Java 6) if you'd like. I'll put in some pull requests in a few days -- I'm finished up some memory issues right now. |
Cool. Looking forward to that code. Can you please try and make multiple small PRs? It will help me review the code and understand the changes and their implications. Thanks! |
I'll have it in a lot of commits so you can follow it. Unfortunately, the PR is for the repo (not the commit), so there can't be different PRs for each commit (you just the the whole thing) . :/ |
The PR is for a specific branch. You could make intermediated branches by Also, the code is quite stable now. There have been no issues with the core 2015-02-22 19:14 GMT+01:00 dorkbox [email protected]:
|
Of course! I'll put it into separate branches, no problem at all. I agree, keeping it stable is very important. |
@dorkbox, how does this going on ? we're looking forward to the boost way for mbassador improvement for queue or disruptor way or some more test as the comparison with guava or rribbit on event bus solution. |
Been super busy with work -- but I've been working on ways to improve the queue/executor, a 0 GC + good performance executor is what I've been working on. A cross between LTQ, Disruptor, and Exchanger. Concurrency is really, really hard - but I'm making solid progress. |
@dorkbox Sounds exciting! I am also quite busy with work currently so I will not have much time to spend on mbassador in the next weeks. But I will have a look at your work as soon as you are done, that's for sure. Your participation is greatly appreciated. |
@dorkbox , how about your progress for your cross operation for the purpose of improving ? |
@dorkbox, any update with such issue or any idea with your work merged into mbassador branch |
Sorry for taking so long -- I've found some other areas to improve ram/performance (it deals with iterating over certain collections) during my local tests/improvements. I'll be adding (and discussing those) back to the main project) I'm still testing the executor, and I'll post it as soon as I finish. I'm doing computer science masters-level work (surprisingly few, but really good, papers on this topic) and it's really hard. |
After many months of research and late nights, I've finished the blocking queue -- it's heap allocation is constant (does not change during runtime), and consequently has zero GC; also it scales rather well. The brains of the algorithm are from the EXCELLENT (and ridiculously fast) MPMC queue written by Nitsan. I re-wrote his MPMC queue to also be a blocking queue (similar to how the LinkedTransferQueue operates). I called it the MpmcTransferArrayQueue (MTAQ), as it is based on the MpmcArrayQueue I'm still cleaning up/attributing the code, and I'll have it available on github (in my own project), as well as part of MBassador (if approved/wanted). The performance and memory benefit on my very simplified and structurally changed fork of MBassador was noticable. For MBassador (master), the effects were less, as there is a slight performance improvement - but the more noticeable improvement is in the consumed memory. There are also different ways for MBassador to handle/dispatch messages as well, so that could vary quite a bit. The following is a performance breakdown of a comparison between LinkedTransferQueue (the latest version is generally accepted as one of the fastest java-collection blocking queues) and MTAQ, each running in different modes and with 2 Consumers/2 Producers, or with 4 Consumers/4 Producers. I did not benchmark other configurations, since SingleConsumer/SingleProducer queues would use an entirely different data structure, and most consumer grade hardware will have fewer than 8 cores. The following tests were on an i7-4700HQ CPU @ 2.40GHz w/ 16gigs of ram, running Linux 3.13.0-49-generic, x86_64.
Given the synchrony of MBassador, I didn't notice an improvement in performance (the dispatch times are only slightly closer to each other, and are likely statistically insignificant), but there is less RAM used since the queue no longer uses the heap. LinkedBlockingQueue (current master, not LTQ) (edit: added LinkedBlockingQueue to performance chart) |
Wow. You have been making quite some progress on the topic. Did I understand correctly that you are doing this work in the context of university studies? I think you are doing a really good job here. What are the dependencies of the code for the MTAQ? Do you use Java classes from JDK 7 or is it Java 6 compatible? If not I think it would be nice to provide them as extensions to the core. It would be too good to have your work become part of the mbassdor project. And can you link the papers you mentioned? |
@dorkbox, any update with such issue ? |
Yes. Going to be submitting code later today |
@dorkbox, any update with such issue ? |
@dorkbox, thanks for following such issue and make further improvement with mbassador. |
There are a few slight improvements (once I finish the MTAQ review for JCTools), so I'm leaving this issue open until that is complete. |
Thanks for review. as new mbassador released with bennidi/mbassador#125 brought it, looking forwar d to your MTAQ modification. for another side, I want to do some integration with axon dispatch commandbus and mbassador integration and I think mbassador could leave use huge performance once MTAQ is merged. |
@dorkbox, any update with such issue ? |
I'm busy finishing a fork/fix/rewrite of the Universal TweenEngine (https://github.com/dorkbox/TweenEngine) - as it had some nasty bugs coupled with a really complicated state machine, and I don't want to context switch until it's done (which hopefully is soon. It's tricky, especially concerning things like GC, reducing memory usage and trimming unnecessary calculations. |
@bwzhang2011 I have updated the pull request with JCTools, and once it's merged - I will finish the implementation of MTAQ for mbassador. |
@dorkbox, thanks a lot for your great efforts for performance improvement for mbassador. I will continue to add such in my project. what's more, it will be better take the distribute way into consideration. maybe in the future @bennidi should take some way for that to be the orientation. as jctools bring in some IPC way, maybe mbassador could develop some other project for that. |
@dorkbox anyway, I am interested to see performance of mbassador increase, as it's the core of our development here. A big thanks to @bennidi sofar |
@roeltje25 I wouldn't be concerned about mbassador, as it is WAY better than other known message bus implementations. It's stable and incredibly simple (as things go) which is why @bennidi doesn't need to actively develop it. I should also mention: mbassador is very fast on it's own, and the only technique I could discover to dramatically improve it's performance was to strip out a lot of features; almost entirely to do with object creation. The performance improvement that MTAQ brings, is that it is based on insanely fast queues (from JCTools) and removes object creation. The backbone of (pretty much all) thread executors are queues, and MTAQ addresses this. I have a fork that I'm using internally, which is a stripped down version of mbassador, but I wouldn't use it yet -- I'm waiting on the final version of JCTools and then another set of regression testing before recommending it in production systems. |
@dorkbox @roeltje25 indeed I have been slow to review the PR :-( I will make it a priority |
@nitsanw, any update with such issue epecially with the PR from dockbox ? |
@bwzhang2011 I have reviewed the PR, see dialogue here: JCTools/JCTools#68 |
@nitsanw is correct, and I'm currently investigating what to do for the purposes of mbassador. I should have an update on this issue soon, as my work schedule permits. |
@dorkbox, esp for myself much appreciated of your attitude for your great work and experiment efforts for such data structure and hope your implementation could be integrated with mbassador no matter whether it would be fully accepted with jctools. |
The architecture of mbassador just won't work with some of the enhancements that I have identified. As a result, I have forked (and changed some functionality) of mbassador to accomplish this. It's a much simpler version of mbassador, but is a bit faster as a result of fewer features along with my enhancements. I would say it is more of a "pure" pub/sub messagebus, with no frills (sorting, priority, filters, etc), it's just subscribe and publish with high-performance async publication. On that topic, and a rather important note: the use of the disruptor for async publication is significantly faster than anything else. Synchronous publication is a little bit faster, and outside of benchmarks I don't think it would be noticeable. All is not lost, as I will be issuing a PR for adding a limited, but appropriate, enhancements to mbassador. Just to make extremely clear ... MBassador is already really fast, and there's just not a whole lot to make faster. Specifically, this PR will be implementing the I will attach benchmarks in the next two posts. |
This is synchronous publication. The first is MBassador, the second is my fork. |
This is asynchronous publication. The first is MBassador, the second is my fork. |
For more detailed and extensive tests, see my Benchmarks project These tests aren't to describe "real world" performance, but to derive comparisons between different implementations, and because of |
@dorkbox Fair attribution: the single-writer-principle is a @mjpt777 term I have used, but not originated. See the blog post here: http://mechanical-sympathy.blogspot.co.za/2011/09/single-writer-principle.html |
@dorkbox I see that you managed to gain around 25% of performance in synchronous dispatch (MBassador ~250ms and your fork ~ 180ms for two million handlers). That is impressive. I would not have expected this margin to be available :) But as you said, you had to remove some of the features that I would consider an important part of the library. Anyways, it's great to see that your efforts were rewarded. I will gladly include a reference to your project in MBassadors main readme. As to the graphs about async dispatch, I think I am not able to interpret them correctly as the look completely different. What do you get out of them? I would also be very happy to hear a brief summary of your learnings with Disruptor. If I remember correctly you were struggling in the beginning to make it work. What were you doing wrong? Thanks again for all your work. |
@bennidi You're very welcome, and thank you for the mention. This has been an interesting journey, and I have learned an incredible amount about concurrent programming, what works and what doesn't. Also, I'm really happy with my results, and it feels great to claim <100ns per message dispatched. The major performance contribution was applying the The other optimizations (in order of how much they enhanced the performance) were to remove as many branch conditions as possible -- which had the side affect of removing a bunch of features; to use faster collections (the Kryo IdentitiyMap instead of HashMap); and to modify the use of your ConcurrentSet Iterators (Strong and Weak) to not generate objects - (here and here). Removing object generation was a goal of mine, and I'm not convinced that change had much of an impact on the performance... If the charts were a straight line it would be easier to measure. You'll see that anywhere there is concurrent access to a collection, the
Those graphs are a bit tricky to interpret for me as well. The best I can think of to explain the differences... Since MBassador queues all of the publications (which is what originally started me down this "rabbit hole") the tests finish running with nothing/very little actually getting published. The difference is that via the Disruptor, it's fast enough to be able to "stay on top" of processing the publications.
Yes - I was really, really struggling with the Disruptor - this was a year ago and so my memory is a little fuzzy on the details, so I will explain how I got it to work in round 2. I don't know the exact problem a year ago, but I tried to modify the example and use "it all at once", and it failed to perform; and then again on my first "retry" a few months ago, I modified it all at once - and it failed to perform. My success came when I took a performance test example and very slowly adapted it (while benchmarking after each modification) to my own use. The best I can say, is that the Disruptor is extremely sensitive to all of it's parameters, and any changes made that differ from the examples have to be benchmarked to make sure those changes didn't break anything in the process. |
@dorkbox would a single writer Identity/Hash Map/Set help here? |
@nitsanw Currently the |
Great job. |
@dorkbox I mean a single thread updates the map/set. Multi/single reader makes no odds in the map/set case. |
Here's some initial tests. I also included JVM memory usage, to help understand the data. I don't think I have the disruptor completely worked out, as it seems really, really slow. One thing I did notice is that the default LBQ setting on MBassador used up to 1 GB of ram. Changing it to have an LBQ of size 16 dropped it to about 512 MB without affecting too much. The disruptor with a ring buffer of size 16 uses ~62 MB. I suspect pooling objects would help.
Linked Blocking Queue (size:Integer.MAX_VALUE)
![LBQ-intmax-chart](https://cloud.githubusercontent.com/assets/5301521/6149125/d20f15ec-b206-11e4-9526-5b8460f9e72e.jpg)
Linked Blocking Queue (size:16)
![LBQ-16-chart](https://cloud.githubusercontent.com/assets/5301521/6149128/d750f3b8-b206-11e4-8ea1-09d9e4365c80.jpg)
Disruptor (2 worker) + ReflectASM
![Disruptor-2-worker-pool-reflectasm-chart](https://cloud.githubusercontent.com/assets/5301521/6149140/e7c2cc76-b206-11e4-9716-9f1b10095a24.jpg)
The text was updated successfully, but these errors were encountered: