-
-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarks - Some comments #886
Comments
Thanks :-)
sounds awesome!
Yes and no. Let me try to explain. MimeKit's MimeParser (and GMime's GMimeParser) mostly just splits header blocks into individual header field & value pairs. It does not decode the headers nor does it decode MIME content. So far you are correct. MimeKit's MimeMessage/MimeEntity/Part/etc (and GMime's GMimeMessage/GMimeObject/GMimePart/etc) would then parse/decode many of the individual headers as the *MimeParser fed those headers to those parts. GMime's GMimeMessage handled fewer of those headers than MimeKit's MimeMessage (MimeKit handled additional headers like the Resent-* headers, the Priority headers, etc that GMime never did) and is likely one of the reasons that MimeKit's parser is (was?) a little slower than GMime (I think MimeKit's parser is an incremental improvement over GMime and I have not gone back and put in the effort to bring those design improvements back to GMime, unfortunately). Since then, I have restructured MimeMessage and I think MimeEntity/etc parts to be more lazy in that it now defers parsing of the Subject/From/To/Cc/Bcc/Date/etc headers until the corresponding MimeMessage property is accessed (e.g. MimeMessage.Subject, MimeMessage.From, etc). So again, your assessment is fairly accurate. When I originally wrote the benchmarks, however, this was not the case - MimeMessage was pretty greedy (in that it would immediately decode every header that it had a corresponding property for). In fairness, even if that were still true, you wouldn't technically be wrong in that it doesn't decode every header, so you're right in that it's not an exactly apples-to-apples comparison.
It's less of a fair (apples-to-apples) comparison now than it was when I originally wrote the benchmarks, for sure. I would suggest taking a peek at MimeKit's MimeReader and the ExperimentalMimeParser (which is built on it) that I hope to replace MimeParser with once I work up the courage to make that switch. With so many people using MimeKit now and depending on its reliability, I am nervous about making that switch even though I beefed up my unit tests and made sure that the results matched in both cases. It's still nerve racking! My guess is that if you are pulling off 4.5ms even after doing full rfc2047 header decoding to UTF8 along with SPAM processing and base64/qp decoding of content(?), then that very well could be as fast or faster than MimeKit (if there was a way to do a more apples-to-apples comparison). Pretty impressive! Is this an open source library? |
Thanks @jstedfast for your gracious feedback and time to answer. The library is not open source, not because we want to close it, but that it depends on Just-JS -- see more here: https://www.techempower.com/benchmarks/#section=data-r21&test=composite (#1 framework in terms of speed). I'll separate into its own repo and share shortly. You won't be able to easily run it, but can see the code (the JS is simple, it depends quite a bit on C++ code for Base64 and QP). The Just-JS developer used the v8 engine and wrote his wrapper / event loop around it, focusing on Linux-only. So no bloating for multi-os. Lighting fast! At the expense of junior-developer friendliness. With NodeJS we wouldn't be able to be even close. However, we were very heavily inspired by this famous library: https://github.com/postalsys/postal-mime. Its core strength is a single buffer that it traverses (Vs. strings, regex etc.), with the following key changes:
|
And
|
I think I took a look at this a few years ago and it was very impressive, but it looked like (and I didn't test to verify this) it would break when fed base64 content that had line breaks (as found in MIME). How did you handle that? Or was I just wrong? I suppose you could base64 decode line by line, but my theory was that that would destroy the performance gains that it would have over MimeKit's base64 decoder (again, untested, and I should really play with it to find out).
Similar approach to MimeKit I guess, although you might have a better algorithm than MimeKit does... I've seen boyer-moore approaches where as MimeKit's parser loops line-by-line and then matches at the beginning of each line.
Yea, MimeKit has far less of this than any of the other C# libraries I've seen, but to be honest, there's still more of it than I would like. There's definitely still room for improvement in this area that I could be doing in MimeKit. |
Very crude, but works :)
Make a copy of buffer without the non-base64 chars. We learned not to try to be smart sometimes and go for back-to-basics. This way we just pass a link to the buffer with offset and len, and get back base64 decoded from C++. One line of code in JS.
In our case, we were able to do most reading in fast forward mode to the next boundary. Only headers are line by line. No performance benefits to handling any differently as we need to manually parse headers anyway. Once past the main header, all internal headers / boundaries are a few lines at the most.
This is really the strength of the algorithm. In fact, we are considering overwriting the original buffer in the future instead of mallocs (base64 decoded will always be smaller than original, so just overwrite while decoding, same for QP). |
And on the way back in (to create MIME), here is how we split base64-encoded
|
Dear @jstedfast
A quick note to say thank you for all that you have done for email parsing and handling for the community. A big debt of gratitude is owed to you.
First question:
We developed our own MIME parser in JS (not Node, our own Just/v8-based framework, so native-like) with lots of C++ bindings (Base64, QP, Textdecoding etc.). We are able to achieve similar benchmark numbers as yours. In fact, we worked hard to get to your numbers.
If we focus on Xamarin3:
I took a glance at the parser in MimeKit (and GMime). It seems to be a lazy/opportunistic parser. Wanted to make sure that we are comparing apples to apples.
Second question:
The other parsers you are comparing to seem to include a lot more work in their parser that you do later when fields are requested. Is it a fair apple-to-apple comparison here?
Thanks again! This work is a great instigator of performance improvement for the entire email ecosystem. Wildduck talks about 30 emails / s. With a good parser like yours we can hit 1000 emails / s.
The text was updated successfully, but these errors were encountered: