Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large amount of ram used by MBassador #1

Closed
dorkbox opened this issue Feb 11, 2015 · 81 comments
Closed

Large amount of ram used by MBassador #1

dorkbox opened this issue Feb 11, 2015 · 81 comments

Comments

@dorkbox
Copy link

dorkbox commented Feb 11, 2015

Here's some initial tests. I also included JVM memory usage, to help understand the data. I don't think I have the disruptor completely worked out, as it seems really, really slow. One thing I did notice is that the default LBQ setting on MBassador used up to 1 GB of ram. Changing it to have an LBQ of size 16 dropped it to about 512 MB without affecting too much. The disruptor with a ring buffer of size 16 uses ~62 MB. I suspect pooling objects would help.

Linked Blocking Queue (size:Integer.MAX_VALUE)
LBQ-intmax-chart

Linked Blocking Queue (size:16)
LBQ-16-chart

Disruptor (2 worker) + ReflectASM
Disruptor-2-worker-pool-reflectasm-chart

@bennidi
Copy link
Owner

bennidi commented Feb 11, 2015

All charts show consistent read performance (Handler Invocation). Write performance varies significantly (high std. deviation) which can be explained by unpredictability in thread scheduling related to synchronized resources.

Memory consumption is significantly lower for limited queue size but comes with a decrease in performance.

Disruptor is considerably slower (explanation?). Processing is not constant but shows hotspots (clusters of data points). Memory consumption is significantly lower which is partly due to lower throughput.

@dorkbox Any explanations for the low performance of disruptor? Maybe its model does not match the scenario?! Or its not correctly used. What about a custom ring buffer with an atomic index. I already build that once and it wasn't very difficult and gives a constant memory footprint with low synchronization overhead. Maybe an array blocking queue would be a viable alternative, too?!

And many thanks to your great work. It is really interesting to see those figures. I once used JProfiler to trace thread activity (running,waiting,blocked...) to compare MBassador with Guava. I never posted the results (which I should do, I realize). But this could be helpful in figuring out why Disruptor is so much slower. It probably does work that is unnecessary (after all there are not many different consumers/producers).

@dorkbox
Copy link
Author

dorkbox commented Feb 11, 2015

I'm not sure why it's so slow - I think that it has to do with queueing messages. I think that the benchmark measure how fast messages can be queued, not necessarily how long it takes to execute each message.

I'm experimenting with ring buffers and object pools to see how that affects the performance. I must admit I'm rather disappointed in Disruptor's performance. I did read that the primary use for the disruptor is for handling (fixed lengths) bytes of data.

What was interesting, was that having a larger ringBuffer (ie: anything larger than 16) also resulted in progressively worse performance -- and it doesn't make any sense to me why. think that a larger ring buffer would help queue entries.

@dorkbox
Copy link
Author

dorkbox commented Feb 11, 2015

Here's a Multiple Producer, Multiple Consumer queue, instead of the disruptor. It uses a ringbuffer, as you can see by the memory consumption. One of the things I'm looking at is having two queues. One for dispatch, and another for invocation (similar to what you already have).

This is just a single queue + invocation
mpmcqueue-old-dispatch-chart

This is disabling invoking, and just timing the queues (LBQ + MPMC queue).
dispatch-invoke-only-chart

@bennidi
Copy link
Owner

bennidi commented Feb 11, 2015

Hmmm. Why is the Disruptor so slow in comparison? I was so intrigued by their idea, the paper and the many interesting blog posts by Martin Thompson. There must be something wrong in the way it is used.
But I think that the performance of the single queue with ring-buffer is quite impressive: About a million invocations per second. Are the handlers synchronous and do you really measure all their invocations? That would be quite a nice result. Especially when you consider the low memory footprint. What queue implementation are you using?

@dorkbox
Copy link
Author

dorkbox commented Feb 12, 2015

Indeed. Their idea, paper, blogs, videos, etc looked really, really good. I think the correct way to use the disruptor is only to hand off data, not to actually process it. I've banged my head on the wall for a while, and I have no idea what I could possibly be doing incorrectly --- so what I've done is moved on to something else.

I'm using the MPMCqueue from here: http://psy-lob-saw.blogspot.de/2015/01/mpmc-multi-multi-queue-vs-clq.html

Moving to a "dispatch" queue and a "invoke" queue has really helped performance, and using limited queue sizes really helps keep the memory footprint down.

For funsies, here's using a LinkedTransferQueue + MPMC queue.

2-transfer-chart

@dorkbox
Copy link
Author

dorkbox commented Feb 12, 2015

And here is JUST a LinkedTransferQueue (i'm starting to really like it too)

only ltq-chart

@bennidi
Copy link
Owner

bennidi commented Feb 12, 2015

Wow! LinkedTransferQueue is outperforming the rest of the candidates by far in terms of throughput. So, when you need low memory footprint you should take MPMC Queue which will still give good performance (probably best in terms of memory:throughput ratio). If memory is not so much of a concern, then LTQ is the best option.

This is quite an insight. I will definitely bring that into the next release. As for the reflection part: I am still convinced that reflective invocation gets completely optimized away on hotspots and as it is JDK standard I think it is good to keep it the way it is. It is stable and well performing. Do you have other code optimizations/refactorings that are compatible with the current code base that you would like to see in there? Maybe we could collaborate on this project and continue development in on place?

@dorkbox
Copy link
Author

dorkbox commented Feb 12, 2015

Based on the performance results, I'm also convinced there is no need for reflectASM -- at least for method access. For field access I have no idea. I'll play around with inflation on the JVM to see what that does too.

I'll work on getting code into a compatible state the changes based on your master. I did have to make some small changes to the testing framework (adding colors and memory usage, for example), so I'll clean that up and and put in a pull request. It will be a few days though, since I'm swamped with work.

@dorkbox
Copy link
Author

dorkbox commented Feb 12, 2015

more info on LTQ performance: http://php.sabscape.com/blog/?p=557

@dorkbox
Copy link
Author

dorkbox commented Feb 12, 2015

turns out, i think i figured out the Disruptor. It is REALLY fast and handing off data. really, REALLY fast. The downside, is that if it does any work on that data, I have yet to see it go fast. (as shown earlier)

My test used Disruptor to hand data off to an executor. Wow. The downside, is that the LTQ that it handed data off to grew to about 2 gigs. Not what we want, but interesting.

@dorkbox
Copy link
Author

dorkbox commented Feb 20, 2015

I did some benchmarks (notably from here, because of how misinformed answers are online).

reflective invocation (without setAccessible) 182.837 ns
reflective invocation (with setAccessible) 1.757 ns
reflectASM invocation 0.019 ns
methodhandle invocation 6.391 ns
static final methodhandle invocation 0.019 ns
direct invocation 0.019 ns

@bennidi
Copy link
Owner

bennidi commented Feb 20, 2015

Looks like another tweak that can easily be integrated. The good news is that it doesn't require any external library like reflectASM. Only limitation is that it is available only in Java >= 7.

@dorkbox
Copy link
Author

dorkbox commented Feb 20, 2015

Maybe, the methodHandle has to be "static final", which means it's tricky. I think reflectASM is the only way to get it down that low, however, 1.7 ns is really, really fast. I'm currently working on some data structures to get object creation down too -- there's a lot going on.

Here's the latest (but not backported yet) -- the timing isn't quite what I want yet, what do you think? It turns out the memory usage isn't quite accurate (it's also measuring the test framework) - I've done some profiling in VisualVM to make improvements

chart

@dorkbox
Copy link
Author

dorkbox commented Feb 20, 2015

Here's with subscription & publication concurrent.

1.5x - 2x as fast for read operations.
256 ns/op VS 169 ns/op

chart

@dorkbox
Copy link
Author

dorkbox commented Feb 20, 2015

slightly tuned:

~ 138/117 ns/op

4-chart

@bennidi
Copy link
Owner

bennidi commented Feb 22, 2015

Looks like you are making good progress. How do we go about bringing this back into the core of MBassador? I am currently working on some minor tickets which are mainly about configuration, error handling and stuff. I am also thinking about changing to LTQ and MethodHandle. Have you tweaked other parts?

@dorkbox
Copy link
Author

dorkbox commented Feb 22, 2015

I've done some more tweaks (faster iteration over collections) and I'm working on managing memory more efficiently (caching subscriptions for superClasses).

The only way MethodHandle helps is if it's "static final", which is impossible to do for dynamic method invocation (where JIT will inline the method call). Reflection via "Method" (as it currently is), or RelflectASM are the only performant options from what I can tell.

Also, I can make the changes to LTQ (i've already got the source included, so it'll work on Java 6) if you'd like.

I'll put in some pull requests in a few days -- I'm finished up some memory issues right now.

@bennidi
Copy link
Owner

bennidi commented Feb 22, 2015

Cool. Looking forward to that code. Can you please try and make multiple small PRs? It will help me review the code and understand the changes and their implications. Thanks!

@dorkbox
Copy link
Author

dorkbox commented Feb 22, 2015

I'll have it in a lot of commits so you can follow it. Unfortunately, the PR is for the repo (not the commit), so there can't be different PRs for each commit (you just the the whole thing) . :/

@bennidi
Copy link
Owner

bennidi commented Feb 22, 2015

The PR is for a specific branch. You could make intermediated branches by
cherry-picking your commits from master such that each branch contains a
closed and working optimization. Otherwise it will be really hard for me. I
can not take a whole bunch of changes, especially the API needs to stay the
same. Please try to sort out your work into meaningful chunks.

Also, the code is quite stable now. There have been no issues with the core
functionality for about a year now. I really want this to continue.

2015-02-22 19:14 GMT+01:00 dorkbox [email protected]:

I'll have it in a lot of commits so you can follow it. Unfortunately, the
PR is for the repo (not the commit), so there can't be different PRs for
each commit (you just the the whole thing) . :/


Reply to this email directly or view it on GitHub
#1 (comment)
.

@dorkbox
Copy link
Author

dorkbox commented Feb 22, 2015

Of course! I'll put it into separate branches, no problem at all. I agree, keeping it stable is very important.

@bwzhang2011
Copy link

@dorkbox, how does this going on ? we're looking forward to the boost way for mbassador improvement for queue or disruptor way or some more test as the comparison with guava or rribbit on event bus solution.

@dorkbox
Copy link
Author

dorkbox commented Mar 8, 2015

Been super busy with work -- but I've been working on ways to improve the queue/executor, a 0 GC + good performance executor is what I've been working on. A cross between LTQ, Disruptor, and Exchanger. Concurrency is really, really hard - but I'm making solid progress.

@bennidi
Copy link
Owner

bennidi commented Mar 9, 2015

@dorkbox Sounds exciting! I am also quite busy with work currently so I will not have much time to spend on mbassador in the next weeks. But I will have a look at your work as soon as you are done, that's for sure. Your participation is greatly appreciated.

@bwzhang2011
Copy link

@dorkbox , how about your progress for your cross operation for the purpose of improving ?

@bwzhang2011
Copy link

@dorkbox, any update with such issue or any idea with your work merged into mbassador branch

@dorkbox
Copy link
Author

dorkbox commented Apr 2, 2015

Sorry for taking so long -- I've found some other areas to improve ram/performance (it deals with iterating over certain collections) during my local tests/improvements. I'll be adding (and discussing those) back to the main project)

I'm still testing the executor, and I'll post it as soon as I finish. I'm doing computer science masters-level work (surprisingly few, but really good, papers on this topic) and it's really hard.

@dorkbox
Copy link
Author

dorkbox commented May 4, 2015

After many months of research and late nights, I've finished the blocking queue -- it's heap allocation is constant (does not change during runtime), and consequently has zero GC; also it scales rather well. The brains of the algorithm are from the EXCELLENT (and ridiculously fast) MPMC queue written by Nitsan. I re-wrote his MPMC queue to also be a blocking queue (similar to how the LinkedTransferQueue operates). I called it the MpmcTransferArrayQueue (MTAQ), as it is based on the MpmcArrayQueue

I'm still cleaning up/attributing the code, and I'll have it available on github (in my own project), as well as part of MBassador (if approved/wanted).

The performance and memory benefit on my very simplified and structurally changed fork of MBassador was noticable. For MBassador (master), the effects were less, as there is a slight performance improvement - but the more noticeable improvement is in the consumed memory. There are also different ways for MBassador to handle/dispatch messages as well, so that could vary quite a bit.

The following is a performance breakdown of a comparison between LinkedTransferQueue (the latest version is generally accepted as one of the fastest java-collection blocking queues) and MTAQ, each running in different modes and with 2 Consumers/2 Producers, or with 4 Consumers/4 Producers.

I did not benchmark other configurations, since SingleConsumer/SingleProducer queues would use an entirely different data structure, and most consumer grade hardware will have fewer than 8 cores.

The following tests were on an i7-4700HQ CPU @ 2.40GHz w/ 16gigs of ram, running Linux 3.13.0-49-generic, x86_64.

Mode (Threads) LBQ LTQ MTAQ
Blocking (2x2 Threads) 1,6m op/s 2,4m op/s 2,8m op/s
Blocking (4x4 Threads) 1,1m op/s 1,4m op/s 2,8m op/s
Non-Block (2x2 Threads) 1,8m op/s 3,4m op/s 3,2m op/s
Non-Block (4x4 Threads) 0,9m op/s 1,8m op/s 7,8m op/s

Given the synchrony of MBassador, I didn't notice an improvement in performance (the dispatch times are only slightly closer to each other, and are likely statistically insignificant), but there is less RAM used since the queue no longer uses the heap.

LinkedBlockingQueue (current master, not LTQ)
chart

MTAQ
chart

(edit: added LinkedBlockingQueue to performance chart)

@dorkbox
Copy link
Author

dorkbox commented May 5, 2015

For reference, here is a semi-final performance graph in use by my fork of MBassador.

chart

@bennidi
Copy link
Owner

bennidi commented May 6, 2015

Wow. You have been making quite some progress on the topic. Did I understand correctly that you are doing this work in the context of university studies? I think you are doing a really good job here. What are the dependencies of the code for the MTAQ? Do you use Java classes from JDK 7 or is it Java 6 compatible? If not I think it would be nice to provide them as extensions to the core. It would be too good to have your work become part of the mbassdor project. And can you link the papers you mentioned?

@bwzhang2011
Copy link

@dorkbox, any update with such issue ?

@CodeMason
Copy link

Yes. Going to be submitting code later today

@bwzhang2011
Copy link

@dorkbox, any update with such issue ?

@dorkbox
Copy link
Author

dorkbox commented Sep 9, 2015

@bwzhang2011
Copy link

@dorkbox, thanks for following such issue and make further improvement with mbassador.

@dorkbox
Copy link
Author

dorkbox commented Sep 13, 2015

There are a few slight improvements (once I finish the MTAQ review for JCTools), so I'm leaving this issue open until that is complete.

@bwzhang2011
Copy link

Thanks for review. as new mbassador released with bennidi/mbassador#125 brought it, looking forwar d to your MTAQ modification. for another side, I want to do some integration with axon dispatch commandbus and mbassador integration and I think mbassador could leave use huge performance once MTAQ is merged.

@bwzhang2011
Copy link

@dorkbox, any update with such issue ?

@dorkbox
Copy link
Author

dorkbox commented Oct 3, 2015

I'm busy finishing a fork/fix/rewrite of the Universal TweenEngine (https://github.com/dorkbox/TweenEngine) - as it had some nasty bugs coupled with a really complicated state machine, and I don't want to context switch until it's done (which hopefully is soon. It's tricky, especially concerning things like GC, reducing memory usage and trimming unnecessary calculations.

@dorkbox
Copy link
Author

dorkbox commented Oct 29, 2015

@bwzhang2011 I have updated the pull request with JCTools, and once it's merged - I will finish the implementation of MTAQ for mbassador.

@bwzhang2011
Copy link

@dorkbox, thanks a lot for your great efforts for performance improvement for mbassador. I will continue to add such in my project. what's more, it will be better take the distribute way into consideration. maybe in the future @bennidi should take some way for that to be the orientation. as jctools bring in some IPC way, maybe mbassador could develop some other project for that.

@roeltje25
Copy link

@dorkbox
It's been some time since you mentioned finishing the MTAQ for mbassador. What is the status? I am also concerned about continuing progress on mbassador now @bennidi has mentioned that he will not continue to actively develop mbassador.

anyway, I am interested to see performance of mbassador increase, as it's the core of our development here. A big thanks to @bennidi sofar

@dorkbox
Copy link
Author

dorkbox commented Dec 6, 2015

@roeltje25
Waiting for my pull request to be merged. Life got in the way (both for nitsanw and myself), and hopefully my changed will get merged soon.

I wouldn't be concerned about mbassador, as it is WAY better than other known message bus implementations. It's stable and incredibly simple (as things go) which is why @bennidi doesn't need to actively develop it.

I should also mention: mbassador is very fast on it's own, and the only technique I could discover to dramatically improve it's performance was to strip out a lot of features; almost entirely to do with object creation. The performance improvement that MTAQ brings, is that it is based on insanely fast queues (from JCTools) and removes object creation. The backbone of (pretty much all) thread executors are queues, and MTAQ addresses this.

I have a fork that I'm using internally, which is a stripped down version of mbassador, but I wouldn't use it yet -- I'm waiting on the final version of JCTools and then another set of regression testing before recommending it in production systems.

@nitsanw
Copy link

nitsanw commented Dec 7, 2015

@dorkbox @roeltje25 indeed I have been slow to review the PR :-( I will make it a priority

@bwzhang2011
Copy link

@nitsanw, any update with such issue epecially with the PR from dockbox ?

@nitsanw
Copy link

nitsanw commented Jan 25, 2016

@bwzhang2011 I have reviewed the PR, see dialogue here: JCTools/JCTools#68
The bottom line is, I have made some corrections/suggestions to the original impl which @dorkbox accepted. I cannot accept the PR at this time as it does not fully implement TransferQueue, but within limitations the implementation is correct and beneficial. If the limited functionality it offers is sufficient for your need than I would suggest you start with that.
@dorkbox please correct me if I'm off the mark.
Thanks

@dorkbox
Copy link
Author

dorkbox commented Jan 25, 2016

@nitsanw is correct, and I'm currently investigating what to do for the purposes of mbassador.

I should have an update on this issue soon, as my work schedule permits.

@bwzhang2011
Copy link

@dorkbox, esp for myself much appreciated of your attitude for your great work and experiment efforts for such data structure and hope your implementation could be integrated with mbassador no matter whether it would be fully accepted with jctools.

@dorkbox
Copy link
Author

dorkbox commented Mar 12, 2016

The architecture of mbassador just won't work with some of the enhancements that I have identified. As a result, I have forked (and changed some functionality) of mbassador to accomplish this. It's a much simpler version of mbassador, but is a bit faster as a result of fewer features along with my enhancements. I would say it is more of a "pure" pub/sub messagebus, with no frills (sorting, priority, filters, etc), it's just subscribe and publish with high-performance async publication.

On that topic, and a rather important note: the use of the disruptor for async publication is significantly faster than anything else. Synchronous publication is a little bit faster, and outside of benchmarks I don't think it would be noticeable.

All is not lost, as I will be issuing a PR for adding a limited, but appropriate, enhancements to mbassador. Just to make extremely clear ... MBassador is already really fast, and there's just not a whole lot to make faster. Specifically, this PR will be implementing the single-writer-principle outlined by Nitsan Wakart.

I will attach benchmarks in the next two posts.

@dorkbox
Copy link
Author

dorkbox commented Mar 12, 2016

This is synchronous publication. The first is MBassador, the second is my fork.
MBassador

MessageBus

@dorkbox
Copy link
Author

dorkbox commented Mar 12, 2016

This is asynchronous publication. The first is MBassador, the second is my fork.

1452812619697_mbassador-async

1457803804204_messagebus_async

@dorkbox
Copy link
Author

dorkbox commented Mar 12, 2016

For more detailed and extensive tests, see my Benchmarks project

These tests aren't to describe "real world" performance, but to derive comparisons between different implementations, and because of
something called OSR, you CANNOT depend on these tests to describe what they are testing in the "real world".

@dorkbox dorkbox closed this as completed Mar 12, 2016
@nitsanw
Copy link

nitsanw commented Mar 12, 2016

@dorkbox Fair attribution: the single-writer-principle is a @mjpt777 term I have used, but not originated. See the blog post here: http://mechanical-sympathy.blogspot.co.za/2011/09/single-writer-principle.html

@bennidi
Copy link
Owner

bennidi commented Mar 13, 2016

@dorkbox I see that you managed to gain around 25% of performance in synchronous dispatch (MBassador ~250ms and your fork ~ 180ms for two million handlers). That is impressive. I would not have expected this margin to be available :) But as you said, you had to remove some of the features that I would consider an important part of the library. Anyways, it's great to see that your efforts were rewarded. I will gladly include a reference to your project in MBassadors main readme.

As to the graphs about async dispatch, I think I am not able to interpret them correctly as the look completely different. What do you get out of them? I would also be very happy to hear a brief summary of your learnings with Disruptor. If I remember correctly you were struggling in the beginning to make it work. What were you doing wrong?

Thanks again for all your work.

@dorkbox
Copy link
Author

dorkbox commented Mar 13, 2016

@bennidi You're very welcome, and thank you for the mention. This has been an interesting journey, and I have learned an incredible amount about concurrent programming, what works and what doesn't. Also, I'm really happy with my results, and it feels great to claim <100ns per message dispatched.

The major performance contribution was applying the single-writer-principle (I recommend reading the this blog post for details http://mechanical-sympathy.blogspot.co.za/2011/09/single-writer-principle.html).

The other optimizations (in order of how much they enhanced the performance) were to remove as many branch conditions as possible -- which had the side affect of removing a bunch of features; to use faster collections (the Kryo IdentitiyMap instead of HashMap); and to modify the use of your ConcurrentSet Iterators (Strong and Weak) to not generate objects - (here and here). Removing object generation was a goal of mine, and I'm not convinced that change had much of an impact on the performance... If the charts were a straight line it would be easier to measure.

You'll see that anywhere there is concurrent access to a collection, the single-writer-principle is applied -- and for some areas, it just wouldn't be easy to apply to MBassador without changing the architecture. I think the main area it could be applied (and this would be what my pull request would be) is to replace the re-entrant locks in the SubscriptionManager and HashMap with IdentityMap.

RE: Async dispatch

Those graphs are a bit tricky to interpret for me as well. The best I can think of to explain the differences... Since MBassador queues all of the publications (which is what originally started me down this "rabbit hole") the tests finish running with nothing/very little actually getting published. The difference is that via the Disruptor, it's fast enough to be able to "stay on top" of processing the publications.

RE: The LMAX-Disruptor.

Yes - I was really, really struggling with the Disruptor - this was a year ago and so my memory is a little fuzzy on the details, so I will explain how I got it to work in round 2. I don't know the exact problem a year ago, but I tried to modify the example and use "it all at once", and it failed to perform; and then again on my first "retry" a few months ago, I modified it all at once - and it failed to perform. My success came when I took a performance test example and very slowly adapted it (while benchmarking after each modification) to my own use. The best I can say, is that the Disruptor is extremely sensitive to all of it's parameters, and any changes made that differ from the examples have to be benchmarked to make sure those changes didn't break anything in the process.

@nitsanw
Copy link

nitsanw commented Mar 13, 2016

@dorkbox would a single writer Identity/Hash Map/Set help here?

@dorkbox
Copy link
Author

dorkbox commented Mar 13, 2016

@nitsanw Currently the put() and get() are wrapped to use the single-writer-principle -- is that what you mean?

@cklsoft
Copy link

cklsoft commented Mar 27, 2016

Great job.

@nitsanw
Copy link

nitsanw commented Mar 27, 2016

@dorkbox I mean a single thread updates the map/set. Multi/single reader makes no odds in the map/set case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants