Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newline missing between headers #695

Closed
nd1012 opened this issue Aug 15, 2021 · 37 comments
Closed

Newline missing between headers #695

nd1012 opened this issue Aug 15, 2021 · 37 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@nd1012
Copy link

nd1012 commented Aug 15, 2021

Describe the bug
HeaderList.WriteTo misses writing a \r\n between two headers.

Platform

  • OS: Windows 10 latest
  • .NET Runtime: CoreCLR
  • .NET Framework: .Net Core 3.1
  • MimeKit Version: 2.14.0

To Reproduce
Code to reproduce the behavior:

ContentType contentType = ContentType.Parse("text/plain");
Header.TryParse(contentType.ToString(), out Header contentTypeHeader);
HeaderList headers = new HeaderList();
headers.Add(contentTypeHeader);
headers["Content-Encoding"] = "binary";
using (MemoryStream ms = new MemoryStream())
{
	headers.WriteTo(ms);
	Console.Write(Encoding.ASCII.GetString(ms.GetBuffer()));
}

Output:

Content-Type: text/plainContent-Encoding: binary

Expected behavior
There should be a \r\n before the Content-Encoding header starts, but it's written in one line.

When I set both headers without the HeaderList.Add method, the line break will be written as expected.

Am I missing something?

@nd1012
Copy link
Author

nd1012 commented Aug 16, 2021

ContentType contentType = ContentType.Parse("text/plain");
Header.TryParse($"{contentType}\r\n", out Header contentTypeHeader);
HeaderList headers = new HeaderList();
headers.Add(contentTypeHeader);
headers["Content-Encoding"] = "binary";
using (MemoryStream ms = new MemoryStream())
{
	headers.WriteTo(ms);
	Console.Write(Encoding.ASCII.GetString(ms.GetBuffer()));
}

When I give a string that ends with a newline to the Header.TryParse method, the header will be written having a newline, when using the HeaderList.WriteTo method.

I use this as a workaround, because I think actually the code during Header.TryParse or HeaderList.WriteTo (preferred) should be responsible for ensuring a newline to produce correct MIME headers. Please feel free to close this issue, if I'm wrong, since it's solved for me for the moment.

@jstedfast
Copy link
Owner

jstedfast commented Aug 16, 2021

What you want to use is:

contentType.ToString (Encoding.UTF8, true);

That will encode the new-line characters.

The normal ToString() method is meant only for display purposes.

@jstedfast jstedfast added the wontfix This will not be worked on label Aug 17, 2021
@jstedfast
Copy link
Owner

This would also work:

HeaderList headers = new HeaderList();
headers["Content-Type"] = "text/plain";
headers["Content-Encoding"] = "binary";

Anyway, closing this since it works the way it does because it has to serialize headers back exactly how it parsed them.

@nd1012
Copy link
Author

nd1012 commented Aug 18, 2021

Thank you, defining the encoding is of course the better usage - I didnt figure it out yet. However, even when using the encoding, the newline is still missing, if I didn't parse it with having a newline already. If this is the expected behavior, it's not a bug, but a misunderstanding.

@jstedfast
Copy link
Owner

Ugh, you are correct - sorry about that.

MimeKit internally uses an internal Encode() method and that's what I was thinking of. But the Encode() method does 2 things differently from ToString(Encoding, bool):

  1. It adds a newline to the end, like you noticed.
  2. It does not return a string that starts with "Content-Type: " like all of the variants of ToString() do.

I gave you bad advice :(

Is there are a reason you are parsing it first and then reserializing it? You could just set the raw value on the HeaderList like this:

HeaderList headers = new HeaderList();
headers["Content-Type"] = "text/plain";
headers["Content-Encoding"] = "binary";

That logic will unfold, parse, and reserialize things so that the header value is appropriately folded to meet MIME recommendations.

@jstedfast
Copy link
Owner

jstedfast commented Aug 18, 2021

FWIW, I believe that this bug report is a very fair issue to ask about.

It is 100% fair to argue that one or more of the following changes should be made:

  1. ContentType.ToString() (especially if encode = true) should add a trailing new-line. Likewise, if this does change, so should ContentDisposition.ToString() when encode = true.
  2. Header.Parse/TryParse should add the trailing new-line if it is missing. The MimeParser uses an internal Header.TryParse() method, so maybe the public APIs could enforce that trailing new-line.
  3. HeaderList.WriteTo() should enforce a new-line - at least for any header that isn't the last header. The last header is really where this should not be done since if we parse a sequence of headers where the last header isn't properly terminated, we want to be able to serialize it back out byte-for-byte exactly as we received it (if we can).

Given that, I'm going to re-open this until I've had time to think about this more.

The ContentType and ContentDisposition ToString() changes scare me a bit because of how it might break existing behavior and therefor could potentially break uses of it in real-world apps. For example, someone might currently be doing:

using (var writer = new StreamWriter (stream)) {
    // Write the MIME headers...
    writer.WriteLine (contentType.ToString (Encoding.UTF8, true));
    writer.WriteLine (contentDisposition.ToString (Encoding.UTF8, true));

    // Write the header terminator
    writer.WriteLine ();

    // Write the content
    ...
}

If I change the ToString() methods to add a terminating newline, then the above code would break.

@jstedfast jstedfast reopened this Aug 18, 2021
@jstedfast jstedfast removed the wontfix This will not be worked on label Aug 18, 2021
jstedfast added a commit that referenced this issue Aug 18, 2021
@jstedfast jstedfast added bug Something isn't working enhancement New feature or request labels Aug 18, 2021
@jstedfast
Copy link
Owner

jstedfast commented Aug 18, 2021

Ok, so I looked into option # 2 and it was a fairly simple change so I implemented it.

This fix will be in a MimeKit v2.15.0 release which I plan on making this week.

@jstedfast
Copy link
Owner

I also added (a few days ago) new ContentType/ContentDisposition .ToString(bool encode) methods so it wasn't necessary to specify Encoding.UTF8.

@nd1012
Copy link
Author

nd1012 commented Aug 18, 2021

Don't worry, these are my first steps with MimeKit, I'm still in gaming-mode ;)

Since your advertising sounds promising, I gave MimeKit (Lite) a try for parsing and building RFC conform MIME headers and bodies. I process headers and bodies separately, because MimeKit wants to manage everything in memory, what could lead to huge memory consumption (not so good on a mobile device f.e.), if the MIME entities are getting bigger. I manage large entities and bodies using a self-made stream, that automatically switches to manage the data in a temporary file, as soon as it exceeds an in-memory limit. The disadvantage is: With that I can't use parts of the high level API of MimeKit in some cases (such as MIME entity parsing, MIME part handling, etc.). But with a little try-and-error until I know MimeKit a bit better, I'm confident that I should be able to manage that using some low level API methods.

The code here is just a small piece of POC code to reproduce the issue. My thoughts when using the ContentType type are, that I can use it as builder and then simply add it to a HeaderList. I want to use the ContentType type as builder-helper to avoid pitfalls (such as encodings) when setting parameters f.e., which I would have, if I had to set the full header value as plain string, as you suggested. But since a HeaderList can only add Header objects, I had to find a way to convert the ContentType instance to a Header instance, first. That's why I finally end up in building -> encoding -> parsing -> encoding, which, of course, is a few steps too much. Now I try to find the best way to get a Header object from a ContentType object (doing the same with a ContentDisposition object is the next step in that story).

What about adding a CreateHeader method to specialized header objects like ContentType? Actually my misunderstanding was, that I can use a ContentType instance with a HeaderList, because actually it is a header. For me it wasn't clear, that ContentType != Header, which started a row of problems.

...
public Header CreateHeader (FormatOptions options, Encoding charset, bool encode)
{
	if (options == null)
		throw new ArgumentNullException (nameof (options));

	if (charset == null)
		throw new ArgumentNullException (nameof (charset));

	var value = new StringBuilder (MediaType);
	value.Append ('/');
	value.Append (MediaSubtype);

	if (encode) {
		int lineLength = value.Length;

		Parameters.Encode (options, value, ref lineLength, charset);
	} else {
		value.Append (Parameters.ToString ());
	}

	return new Header (HeaderId.ContentType, value.ToString ());
}
...

When I add the resulting Header instance to the HeaderList, the output of HeaderList.WriteTo contains a newline as expected. Just an idea to avoid breaking existing code and making it more easy for noobs like me, if it won't blow up the MimeKit code too much.

I think the most effective and non-breaking solution for version 2.x would be option 3, because the HeaderList.WriteTo method is actually producing the unexpected/unwanted result. The other options may be something for a version 3.x, where such breaking changes must be accepted.

@jstedfast
Copy link
Owner

Have you looked into the MimeParser's persistant parameter?

http://www.mimekit.net/docs/html/M_MimeKit_MimeParser__ctor_3.htm

If you need an overridable method on the MimeParser to make it possible for the MimeParser to use a custom stream implementation to store the contents of MimeParts, I can probably add a hook for that.

@nd1012
Copy link
Author

nd1012 commented Aug 18, 2021

Currently I have three limitations, when handling an incoming request MIME entity:

  1. Total entity size (f.e. 20 MB)
  2. Decoded part body size (f.e. 2 MB)
  3. In-memory-handling of a decoded part body (f.e. 80 KB)

I parse the MIME entity (multipart, but not nested) from a (not seekable) NetworkStream on the fly, picking and converting the bodies into streams for the parallel processing and responding app, that will dispose them as soon as they've been processed and are not required anymore. Typically the size of the entity or a part of it can't be predicted (may be chunked transfer encoding or devided by boundaries). My current on the fly parsing steps are:

  1. Read entity header bytes until CR LF CR LF
  2. Parse entity header using MimeKit
  3. Find the beginning of the (next) part
  4. Read part header bytes until CR LF CR LF
  5. Parse part header using MimeKit
  6. Decide, if the part is going to be used, or the body will be skipped
  7. Decode part body until the boundary into a auto-switching temporary memory/file stream
  8. Continue at 3. with the next part until the final trailing boundary of the entity body was found

Depending on the body encoding I process the NetworkStream through (chunked, base64, compression, etc.) decoding stream wrapper that target the temporary stream finally, which will be processed from the app in a separate thread then. The whole entity won't be stored, only the picked parts will be held as temporary memory/file streams until they've been processed. Picking parts is required, because the data source can't know, which parts are required for processing by the app - so it's not possible to optimize the MIME entity before sending.

MimeKit is focusing on managing an entire MIME entity, and it does it pretty good so far. I need to handle the MIME entity like a hot potato in my hands, keep the memory usage as low as possible while the performance stays as good as possible - even if the performance will go down, if a body content exceeds the in-memory handling limitation and needs to be stored in a temporary file. However, since the data source is a NetworkStream, that performance loss shouldn't be a bottle neck here. I think what I want to do is a rare and very special case, different from the common mainstream use cases like email processing.

Finally, for sending the outgoing response MIME entity to the NetworkStream, I use MimeKit for building and sending the headers, while I use encoding stream wrapper for the body, when it comes to compression and chunked encoding.

The persistent stream solution would be perfect, if the source stream wouldn't be a NetworkStream... Anyway, thanks for the tip, maybe I can use it later!

In order to be able to use the MimeParser in my special case, it would be nice, if MimeKit would implement some kind of a stream factory delegate. If the delegate get these parameters

  • Entity parsed so far (that provides access to previously parsed parts)
  • Currently parsing part (that provides access to the already parsed header list)

and returns the stream for storing the (decoded?) content, I'd be able to return a Null-stream to discard the content, or a temporary memory/file stream to store the contents:

// Somewhere in the parser code
Stream bodyStream = factoryDelegate?.Invoke (mainEntity, currentPart) ?? new MemoryStream ();

Just as an example - maybe it'd be better to place a factory option in the parser options, instead of using a constructor parameter (or use a hook, as you suggested):

MimeParser mimeParser = new MimeParser (sourceStream, persistent: false, (mainEntity, currentPart) =>
{
	// Returns the conditional stream to store the decoded body content
	if (currentPart.Headers.Contains("..."))
	{
		return new AnyStream (currentPart.Headers);
	}
	else
	{
		return Stream.Null;
	}
});

In this case the parser shouldn't throw an exception for unsupported encodings, since the used encoding may be supported from the custom stream (if not, the factory delegate should throw instead).

This could also improve email apps, when they're able to decide to use memory streams for inline parts and file streams for attachments f.e.

@jstedfast
Copy link
Owner

jstedfast commented Aug 19, 2021

Okay, so based on what I think you're saying, what you really want is a "streaming" MIME parser (or at least a "streaming" multipart parser).

It seems that there is a special need for this kind of parser in the HTTP world where you can apparently POST more-or-less endless multiparts and/or at least multiparts containing very large file streams. This apparently goes for both client-side where an application might use an HTTP GET request which results in a large multipart as well as server-side implementations of a POST request that contains large multiparts.

MimeKit's MimeParser API is suboptimal for this kind of thing even if I added a way to override which type of Stream the MimeParser creates for storing the content of each MimePart.

That said, even though MimeKit is really targeted for email applications, perhaps I could implement such a streaming parser API for this kind of thing...

In the meantime, MimeKit does provide some pretty handy tooling for what you want to do. You should definitely take a look at the MimeKit.IO.FilteredStream class as well as the MimeKit.IO.Filters.DecoderFilter class (and/or the MimeKit.Encoders namespace for direct usage of the decoders).

Next up, there are some useful techniques to be found in my MailKit project. In particular, the streaming technique I use in both the POP3 and IMAP implementations that wrap a NetworkStream (or SslStream depending on the connection type) that know when to stop reading data based on the current protocol state.

For example, in the Pop3Stream.ReadAsync method, it scans for a line containing ".\r\n" and once it reaches that, declares EOF by returning 0 and will prevent you from reading beyond that point. The ReadAsync method also unmunges (or un-byte-stuffs) the stream, meaning that it takes lines that start with ".." and removes the first dot (this is a POP3ism that is meant to safeguard the termination sequence).

The ImapStream.ReadAsync method is actually a lot simpler since it doesn't have to unmunge any data, it just knows that if it is in the "IMAP Literal String" state, that it should read until the literal is consumed and then return 0 (signifying EOF). In IMAP, there is a concept called a "literal string" which is how IMAP sends large (or non-ASCII) content. The way it does it is like this: ... TOKEN TOKEN TOKEN {12345}\r\n<12345 bytes of literal content go here> TOKEN TOKEN TOKEN (where a token that consists of { + a numeric value + } + <CR><LF> tells the parser that it should treat the next X number of bytes as a single string token).

What I would probably do is implement a "MultipartContentStream" class that did something similar and knew only to read until the next MIME boundary marker and then pretend that it had reached EOF.

// Get the ContentType of the POST request or GET request, I guess?
var contentType = postRequest.ContentType;

// Construct a new StreamingMultipartParser that uses the ContentType to know about the boundary marker.
var parser = new StreamingMultipartParser (contentType, stream);

// consume the multipart prologue (the garbage that leads up to the very first boundary marker)
parser.ConsumeMultipartPrologue ();

// the ReadNext method would construct a MimePart with the parsed headers, same as
// MimeParser.ParseHeaders() does or MimeParser.ParseMessage/Entity() does now
// except that it wouldn't load the content into a MemoryBlockStream. Instead, it would
// set the MimePart.Content.Stream to a MultipartContentStream that knows to stop reading
// once it reaches the multipart boundary marker.
while (!parser.IsEndOfStream) {
    var part = parser.ReadNext ();
    var fileName = GetFileName (part);

    // Note: up until this point, the the content of the MimePart has never been read from the NetworkStream...
    // so part.Content.Stream has vey little overhead. It doesn't even need an input buffer because it should
    // just make use of the parser's internal input buffer.
    using (var fileStream = File.Create (fileName)) {
        // The DecodeTo method will actually be reading the raw MIME content from the NetworkStream and
        // decoding it right into the file stream, in blocks of 4K at a time - so memory usage stays really low.
        part.Content.DecodeTo (fileStream);
    }
}

This could be so blazingly fast... I love it.

Question: Do these types of multiparts support nested multiparts? (because that would make this more complicated).

I would ask if message/rfc822 is supported, but I think that could easily just be mapped (in this parser) to a MimePart as well instead of a MessagePart like MimeParser uses. The problem with trying to handle a message/rfc822 part by returning a MessagePart is that the MessagePart recursively parses the embedded message instead of returning just uninterpreted content.

@jstedfast
Copy link
Owner

jstedfast commented Aug 19, 2021

@nitinag I was wondering if you might be interested in a more SAX-like MIME parser API.

I'm not sure how familiar you guys are with XML parsers (I'm not very familiar myself), but there are 2 main types of XM:L parsers:

  1. The DOM parser which parses the entire document and gives you a Document Object Model representation of the XML document (much like MimeKit's current parser returns a "MIME Document Object Model").
  2. The SAX parser which emits XML nodes as it parses them and leaves it up to you to construct a document if you want it or lets you process nodes on-the-fly.

What @nd1012 essentially wants is more like a SAX parser, but for MIME.

The OnMimeEntityBegin/End-type stuff I added to MimeParser a while back for @nitinag gives the MimeParser some of the features of a SAX parser while still keeping it a DOM parser.

I will likely need to do some brushing up on XML SAX parser APIs to see if I can figure out how they let the consumer know how the nodes are nested. I would likely need to avoid using current MimeEntity-based objects because they expect to hold onto the full list of headers/etc in order to build up the full MIME message/entity.

I sort of see an API like this:

class StreamingMimeParser
{
    // these would be called for MimeMessage *and* MimeEntity headers
    protected void OnHeaderParsed (Header header);

    // MimeMessage stuff
    protected void OnMimeMessageBegin (long offset);
    protected void OnMimeMessageEnd (long offset);

    // MimeEntity stuff
    protected void OnMimeEntityBegin (long offset);
    protected void OnMimeEntityEnd (long offset);

    // MimePart Content
    protected void OnMimeContentBegin (long offset);
    protected void OnMimeContentRead (byte[] content, int startIndex, int count);
    protected void OnMimeContentEnd (long offset);

    // Multipart boundary stuff - consumer would likely need to use these to push/pop their own MIME entity stack?
    protected void OnBoundaryPushed (string boundary);
    protected void OnBoundaryPopped (string boundary);

    // And then, of course, we have the public Parse/ParseAsync methods:
    public void Parse ();
    public Task ParseAsync ();
}

Obviously, all of the OnXyz() methods could all have corresponding events that get emitted by the default implementation so you could choose to either subclass the parser or just add event hooks, whichever way works best for you.

I spent the night tossing and turning thinking about this, trying to figure out a good way to do it for a full MimeParser-type solution rather than just the StreamingMultipartParser API that I laid out in my previous comment which was extremely simplistic.

If I was going to design a solution just for HTTP multipart-type stuff, then I would probably go with the simpler solution... but I think if I was going to implement this as an official part of MimeKit, I'd want a full parser API.

@jstedfast
Copy link
Owner

jstedfast commented Aug 19, 2021

Thought about this a lot today and took another look at the events I already have.

I also got to thinking that if I did this right, it might make it easier to re-structure the existing parser to be more linear and not recursive. Either way, I could likely make MimeParser subclass this new parser and just have a MimeEntity stack internally to construct the MimeMessage/MimeEntity tree.

Anyway... here's the hooks I know I will need after putting even more thought into this:

protected virtual void OnMboxMarkerRead (byte[] marker, long offset, int lineNumber);

protected virtual void OnHeaderRead (Header header, int beginLineNumber);

#region MimeMessage Events

protected virtual void OnMimeMessageBegin (long beginOffset, int beginLineNumber);

protected virtual void OnMimeMessageEnd (long beginOffset, int beginLineNumber, long headersEndOffset, long endOffset, int lines);

#endregion MimeMessage Events

#region MimePart Events

protected virtual void OnMimePartBegin (long beginOffset, int beginLineNumber);

protected virtual void OnMimePartContentBegin (long beginOffset, int beginLineNumber);

protected virtual void OnMimePartContentRead (byte[] content, int startIndex, int count);

protected virtual void OnMimePartContentEnd (long beginOffset, int beginLineNumber, long endOffset, int lines);

protected virtual void OnMimePartEnd (long beginOffset, int beginLineNumber, long headersEndOffset, long endOffset, int lines);

#endregion MimePart Events

#region MessagePart Events

protected virtual void OnMessagePartBegin (long beginOffset, int beginLineNumber);

protected virtual void OnMessagePartEnd (long beginOffset, int beginLineNumber, long headersEndOffset, long endOffset, int lines);

#endregion MessagePart Events

#region Multipart Events

protected virtual void OnMultipartBegin (long beginOffset, int beginLineNumber);

// called when a multipart boundary is encountered (e.g. "--my-boundary-marker\r\n")
protected virtual void OnMultipartBoundary (string boundary, long beginOffset, long endOffset, int lineNumber);

// called when an end multipart boundary is encountered (e.g. "--my-boundary-marker--\r\n")
protected virtual void OnMultipartEndBoundary (string boundary, long beginOffset, long endOffset, int lineNumber);

protected virtual void OnMultipartPreambleBegin (long beginOffset, int beginLineNumber);

protected virtual void OnMultipartPreambleRead (byte[] content, int startIndex, int count);

protected virtual void OnMultipartPreambleEnd (long beginOffset, int beginLineNumber, long endOffset, int lines);

protected virtual void OnMultipartEpilogueBegin (long beginOffset, int lineNumber);

protected virtual void OnMultipartEpilogueRead (byte[] content, int startIndex, int count);

protected virtual void OnMultipartEpilogueEnd (long beginOffset, int beginLineNumber, long endOffset, int lines);

protected virtual void OnMultipartEnd (long beginOffset, int beginLineNumber, long headersEndOffset, long endOffset, int lines);

I may change the OnMimePartContentRead/OnMultipartPreambleRead/OnMultipartEpilogueRead method names? "Read" implies that implementers should read into those buffers, but the truth is that they should write them.

Also not sure if I'll need Async versions of the "Read" (or "Write"?) methods for Async parser APIs. The rest won't need to change.

Maybe should also get passed the CancellationToken?

@nitinag
Copy link

nitinag commented Aug 20, 2021

[written prior to your most recent post - looks like you may have addressed some of it]

Had to refresh myself on the code that we have for this:

The benefit of of something like that for us would be significantly lower memory usage and performance. Right now we only pass messages to our subclassed custom MimeParser after the message has been fully read (into a file when large enough) so that any memory usage (due to messages that can be large - e.g., 100mb) is short lived. Everything is functional as is right now as well, so not a huge priority for us. Since it sounds like you're looking to design an api that covers multiple cases including ours, here are some notes:

We have our own nested tracking since we need to track all the offsets of the nested messages, parts, etc. and their headers. In our tracking, after our own internal tracking/stacks, we end up with a linear list and use the imap part specifier nested format to identify nesting, which does have some specific rules on how to order the nested parts and in special cases like nested and top part encapsulated messages.

We would still need the HeadersEndOffset since we need to store the start/end of the headers for messages/entities (to query just what we need later from the file), so maybe an OnHeadersEnd event or ensuring that OnMimeContent* runs on all types even if 0 length?

We'd also still need the raw body line count since imap requires us to store that as well at least for messages and text parts.

The only part of the parsed entity we use are:

  1. Entity type: [Message, Multipart, TextPart, BasicPart]
    We would need the part type on OnMimeEntityBegin so that we can store it and make choices on what to do based on that. TextPart has slightly different handling thus it is separately identified from BasicPart (MimePart).

  2. The fully parsed headers for messages and parts. For messages we store the full envelope information and for parts the full content headers. On your suggested API, we would build our own header list from OnHeaderParsed events and then take what we need. Though for the envelope information we need, we currently use the MimeMessage object (based on parsing just the headers).

@nitinag
Copy link

nitinag commented Aug 20, 2021

I may change the OnMimePartContentRead/OnMultipartPreambleRead/OnMultipartEpilogueRead method names? "Read" implies that implementers should read into those buffers, but the truth is that they should write them.

Read makes more sense to me since that's what the mimeparser has read and byte[] content holds what the parser has read, which is when the event is triggered. I'm hoping that if we don't use the *Read methods, no memory usage occurs in allocating the byte[] content, etc. for those unused events?

Also not sure if I'll need Async versions of the "Read" (or "Write"?) methods for Async parser APIs. The rest won't need to change.

If we were to ever use the *Read methods, one may also write the content out somewhere so aysnc may be useful to allow that use case. Async would also be useful on all the events so one could run database or network calls to check any conditions based on the entity read.

Thinking about it some more, we'd probably prefer to use ArrayPool and stream based processing when needed on the parts so we don't have to allocate sometimes 100mb memory chunks and might just use this API to get all the offsets and headers pre-processing like we do now. So, we would really avoid those *Read methods. If those *Read events are always called with full byte content, aren't we pretty much back to the entity type memory usage in some ways, maybe even more since MimePart objects point to a stream rather than byte[]? In that case, it doesn't seem that much different or useful than what is already there except that you only have one part in memory at a time, but that MimePart could still be huge?

Maybe should also get passed the CancellationToken?

Would be nice to have a way to stop parsing at any event/point.

@jstedfast
Copy link
Owner

[written prior to your most recent post - looks like you may have addressed some of it]

We would still need the HeadersEndOffset since we need to store the start/end of the headers

Yep. Gotcha covered. I'll be keeping those in the OnXyzEnd() hooks, same as they are currently in the MimeParser event API.

We'd also still need the raw body line count since imap requires us to store that as well at least for messages and text parts.

Yep, I added that to my plan as well - after remembering I had that there in the current MimeParser event API.

Entity type: [Message, Multipart, TextPart, BasicPart]
We would need the part type on OnMimeEntityBegin so that we can store it and make choices on what to do based on that. TextPart has slightly different handling thus it is separately identified from BasicPart (MimePart).

I was thinking that in this new "MimeReader"(?) (I think I like MimeReader better than StreamingMimeParser) API, I was not planning to instantiate MimeEntity objects at all, but it should be possible to use the headers to figure this out. That said... I think I'll need to track the COntentType anyway as I parse the headers so that I know which entity type I'm parsing, so maybe I can pass that into the OnMimePartBegin/MultipartBegin/MessagePartBegin APIs and maybe to the equivalent End APIs as well?

Technically, I think this is only really needed for OnMimePartBegin/End to cover the Basic vs Text differentiation. Maybe I can even add OnTextPartBegin/End instead? Not sure... will have to think on this one.

It's probably possible to just figure it out based on previously-parsed headers (or lack-thereof, implying text/plain).

The fully parsed headers for messages and parts. For messages we store the full envelope information and for parts the full content headers. On your suggested API, we would build our own header list from OnHeaderParsed events and then take what we need. Though for the envelope information we need, we currently use the MimeMessage object (based on parsing just the headers).

The MimeMessage convenience APIs like To/From/Cc/Subject/Date/etc are all doable w/o needing the overhead of MimeMessage... or, if you really want, you can just instantiate a new MimeMessage (or MimePart or whatever) and add the headers you collect from OnHeaderParsed().

None of that is currently handled by the MimeParser anyway, that's all handled by MimeMessage and MimeEntity.

Anyway, I appreciate your feedback! I'll definitely want more feedback if/when I decide to go ahead with this.

Now I just gotta convince the girlfriend to let me spend the necessary time on this ;-)

I've already been spending a lot of late nights hacking on an NTLM rewrite for MailKit and trying to figure out how to add ChannelBinding support as well (starting with the SCRAM-SHA-X-PLUS mechanisms). Then I need to figure out ChannelBinding for NTLM.

Lots of big new features in the works ;-)

@jstedfast
Copy link
Owner

jstedfast commented Aug 20, 2021

I'm hoping that if we don't use the *Read methods, no memory usage occurs in allocating the byte[] content, etc. for those unused events?

My plan was to pass along the MimeReader's internal input buffer (see MimeParser.cs for what I mean). The startIndex and count parameters would indicate where in the buffer the content is located. This would avoid allocating a new buffer.

If we were to ever use the *Read methods, one may also write the content out somewhere so aysnc may be useful to allow that use case. Async would also be useful on all the events so one could run database or network calls to check any conditions based on the entity read.

Yea, this was why I was considering having Async variants, but you raise a good point about how maybe they should all have Async variants? And have CancellationToken parameters as well.

jstedfast added a commit that referenced this issue Aug 23, 2021
The MimeReader is an alternative to the MimeParser.

Unlike the MimeParser, the MimeReader does not construct a MimeMessage
or MimeEntity "DOM". Instead, it works more like a SAX parser in that
it emits events as logical components of a MIME structure are parsed,
allowing the developer to process MIME data as it is parsed rather than
waiting for the entire message or entity to be completely constructed.

This also allows developers a way to reduce their memory overhead.

Implements the API discussed in issue #695
@jstedfast
Copy link
Owner

Ok, whipped up a new MimeReader class tonight based on this discussion and added it to a mimereader branch (based off of my vnext branch).

It's not documented yet, but I ported the MimeParser unit tests that recorded MIME offsets and it passes.

Since the MimeReader can be used to get even more detailed offsets, I'll probably want to expand on the current unit tests to test every corner of it.

A few notes:

  1. I have sync and async versions of all of the OnMimePartBegin/End virtual methods except OnHeaderRead() because I need to restructure some code to achieve that first.
  2. Does the OnXyz naming pattern really work for this? I think the OnXyz naming pattern is typically only used for methods that result in an event being emitted and these do not (altho I guess they could?). AFAICT, the pattern also seems to take EventArgs parameters instead of all of the ints/longs/CancellationTokens that mine do right now. Is there a different naming convention that I should use?
  3. Eventually I'm going to want to replace MimeParser with a new implementation built on top of MimeReader.

@jstedfast
Copy link
Owner

@nd1012 I would also love to hear any input you have. I'm hoping something like this would work better for you than having to awkwardly work around MimeParser limitations for what you are trying to accomplish.

@nitinag
Copy link

nitinag commented Aug 23, 2021

One thought that had occurred to me a few days ago is whether a OnHeaderEnd method/event may be helpful to mark the header section end? As it stands right now, if you were interested in the headers, you could implement OnHeaderRead and build your Header list, but then to know that you've gotten/finished all the headers for that section, you would have to listen to all On*Begin methods to mark the end of the prior header section. I would imagine that OnHeaderEnd would trigger even in the case of 0 length header.

headersEndOffset parameter could be moved to OnHeaderEnd, but there were edge cases where you update the headersEndOffset after reading the body, which are mentioned in the original issue and is the reason headersEndOffset wasn't just included as part of the On*Begin events originally I believe. However, wouldn't that same issue also affect the values being returned by OnHeaderRead then?

I don't know whether that is off convention or not either, but the OnXyz naming pattern is pretty clear and useful from my (an implementers) perspective while subclassing.

@jstedfast
Copy link
Owner

Yea, I think I did have to update the HeadersEndOffset - this was, I think, only for edge cases where the content of the body was empty because that new-line sequence would have to be re-counted as part of the boundary marker. I think it was also needed when the stream was truncated.

That said, if I can figure out a better way of doing it, an OnHeadersEnd() "event" would make sense to have.

@jstedfast
Copy link
Owner

@nd1012 It looks like Microsoft.AspNetCore.WebUtilities might be a good solution for you. Have you looked at it?

https://github.com/aspnet/HttpAbstractions/tree/master/src/Microsoft.AspNetCore.WebUtilities

The MultipartReaderStream is pretty much exactly what I was envisioning in #695 (comment). The MultipartReader class is also similar to what I was envisioning.

May be worth looking into (not that I don't want you to use MimeKit, but WebUtilities may be lighter weight than using MimeKitLite and more tailored to your needs).

jstedfast added a commit that referenced this issue Aug 24, 2021
The MimeReader is an alternative to the MimeParser.

Unlike the MimeParser, the MimeReader does not construct a MimeMessage
or MimeEntity "DOM". Instead, it works more like a SAX parser in that
it emits events as logical components of a MIME structure are parsed,
allowing the developer to process MIME data as it is parsed rather than
waiting for the entire message or entity to be completely constructed.

This also allows developers a way to reduce their memory overhead.

Implements the API discussed in issue #695
@jstedfast
Copy link
Owner

Rewrote the header parser tonight to allow OnHeaderReadAsync()/OnHeadersEnd/Async() and it seems a lot cleaner now, but may be slower 😥

Working on some benchmarks so I can compare. I also want to build a new MimeParser API on top of the MimeReader and compare performance there as well. Would be great to only have 1 actual parser rather than 2 😂

Current benchmarks show MimeParser parsing the JWZ mbox in less than 260ms?? And MimeReader in 178ms? Seems crazy fast!

@jstedfast
Copy link
Owner

jstedfast commented Aug 25, 2021

Looks like I misremembered exactly what test I was remembering rough stats for. My excuse is I was commenting on my phone while watching a movie with my g/f 😂

In any event, here are the raw benchmarks as run this morning.

The following 2 benchmarks were run using the old header parser in the MimeReader tests:

Method Mean Error StdDev
MimeParser_StarTrekMessage 248.000 ms 4.8560 ms 5.3974 ms
MimeParser_StarTrekMessagePersistent 215.199 ms 2.8532 ms 2.6689 ms
MimeParser_ContentLengthMbox 2.613 ms 0.0354 ms 0.0331 ms
MimeParser_ContentLengthMboxPersistent 2.437 ms 0.0440 ms 0.0411 ms
MimeParser_JwzMbox 18.187 ms 0.0874 ms 0.0817 ms
MimeParser_JwzMboxPersistent 16.429 ms 0.0807 ms 0.0674 ms
MimeReader_StarTrekMessage 178.348 ms 0.9655 ms 0.9031 ms
MimeReader_ContentLengthMbox 1.676 ms 0.0112 ms 0.0093 ms
MimeReader_JwzMbox 12.873 ms 0.1357 ms 0.1203 ms
Method Mean Error StdDev
MimeParser_StarTrekMessage 242.943 ms 2.5493 ms 2.1288 ms
MimeParser_StarTrekMessagePersistent 213.239 ms 0.5939 ms 0.5265 ms
MimeParser_ContentLengthMbox 2.523 ms 0.0134 ms 0.0119 ms
MimeParser_ContentLengthMboxPersistent 2.367 ms 0.0088 ms 0.0082 ms
MimeParser_JwzMbox 18.071 ms 0.0856 ms 0.0759 ms
MimeParser_JwzMboxPersistent 16.318 ms 0.0698 ms 0.0583 ms
MimeReader_StarTrekMessage 179.497 ms 1.2597 ms 1.0519 ms
MimeReader_ContentLengthMbox 1.675 ms 0.0123 ms 0.0115 ms
MimeReader_JwzMbox 12.559 ms 0.0670 ms 0.0626 ms

These next 2 benchmark runs were done after applying my patch that rewrote the header parser for MimeReader:

Method Mean Error StdDev
MimeParser_StarTrekMessage 246.242 ms 4.9158 ms 5.4639 ms
MimeParser_StarTrekMessagePersistent 216.666 ms 2.3338 ms 2.1830 ms
MimeParser_ContentLengthMbox 2.528 ms 0.0098 ms 0.0076 ms
MimeParser_ContentLengthMboxPersistent 2.364 ms 0.0135 ms 0.0120 ms
MimeParser_JwzMbox 17.901 ms 0.0623 ms 0.0552 ms
MimeParser_JwzMboxPersistent 16.210 ms 0.0565 ms 0.0472 ms
MimeReader_StarTrekMessage 181.140 ms 2.0074 ms 1.8778 ms
MimeReader_ContentLengthMbox 1.679 ms 0.0098 ms 0.0091 ms
MimeReader_JwzMbox 12.568 ms 0.0544 ms 0.0454 ms
Method Mean Error StdDev
MimeParser_StarTrekMessage 242.428 ms 2.3040 ms 2.1552 ms
MimeParser_StarTrekMessagePersistent 212.466 ms 1.6403 ms 1.5343 ms
MimeParser_ContentLengthMbox 2.520 ms 0.0175 ms 0.0146 ms
MimeParser_ContentLengthMboxPersistent 2.371 ms 0.0096 ms 0.0090 ms
MimeParser_JwzMbox 18.075 ms 0.0987 ms 0.0875 ms
MimeParser_JwzMboxPersistent 16.182 ms 0.0785 ms 0.0734 ms
MimeReader_StarTrekMessage 179.623 ms 0.5818 ms 0.4858 ms
MimeReader_ContentLengthMbox 1.664 ms 0.0086 ms 0.0080 ms
MimeReader_JwzMbox 12.634 ms 0.0415 ms 0.0368 ms

As we can see, the new MimeReader header parser is actually a smidgen slower but I wonder if I can speed things up a bit.

The old header parser uses this type of approach:

byte[] headerBuffer = new byte[512];

fixed (byte* inbuf = input) {
    byte* inptr = inbuf + startIndex;
    byte* inend = inbuf + inputLength;
    byte* start = inptr;

    *inend = '\n';
    while (*inptr != '\n')
        inptr++;

    if (inptr < inend) {
        // we know we have the complete line
        int length = (int) (inptr - start);

        if (length >= headerBuffer.Length)
            Array.Resize (ref headerBuffer, NextAllocSize (length));

        Buffer.BlockCopy (input, startIndex, headerBuffer, 0, length);
    } else {
        // ...
    }
}

whereas the new parser uses this approach:

byte[] headerBuffer = new byte[512];
int bufferIndex = 0;

fixed (byte* inbuf = input) {
    byte* inptr = inbuf + startIndex;
    byte* inend = inbuf + inputLength;
    byte* start = inptr;

    *inend = '\n';
    while (*inptr != '\n') {
        if (bufferIndex >= headerBuffer.Length)
            Array.Resize (headerBuffer, headerBuffer.Length + 64);

        headerBuffer[bufferIndex++] = *inptr++;
    }
}

The downside to the old approach is that if inptr == inend at the end of that loop, we would break out, fill our input buffer, and then restart the loop from the beginning of the line. This is not the case with the new approach (I know the code snippets above don't really illustrate this, but take my word for it). The new approach consumes everything in the input.

The new approach, even though slower per loop iteration, may actually compensate for that by not having to scan over the same input data more than once if the line of data crosses buffer boundaries.

In fact, the old code needs to loop over the same input data at a minimum of 2x. The first time finds the end of the line, and the second is the Buffer.BlockCopy() call. In the hopefully worst-case scenario, it would have to iterate over the input ~3x (maybe the first pass is only missing the '\n' because it hasn't been buffered yet).

One thing I want to try is a hybrid approach to see if the overhead of the Buffer.BlockCopy() is actually less than the length comparison in every loop iteration. I'm pretty sure I've seen the implementation of BlockCopy() and it blits 8/4/2/1 bytes at a time, making it very fast, so it's possible that it is faster.

I could also make the input buffer grow by 80 instead of 64 as that may be a more efficient grow size (MIME recommends wrapping lines to fit within 80 columns).

Anyway... I've got a few tricks I can try to see if I can eke out a bit more performance.

Happy that the new approach isn't drastically slower.

@jstedfast
Copy link
Owner

jstedfast commented Aug 26, 2021

Okay, so .NET 5 runtime results:

Method Mean Error StdDev
MimeParser_StarTrekMessage 230.05 ms 2.613 ms 2.316 ms
MimeParser_StarTrekMessagePersistent 205.40 ms 1.005 ms 0.890 ms
MimeParser_ContentLengthMbox 20.29 ms 0.082 ms 0.069 ms
MimeParser_ContentLengthMboxPersistent 18.99 ms 0.138 ms 0.115 ms
MimeParser_JwzMbox 164.83 ms 0.644 ms 0.603 ms
MimeParser_JwzMboxPersistent 150.66 ms 1.209 ms 1.131 ms
MimeReader_StarTrekMessage 173.72 ms 0.636 ms 0.595 ms
MimeReader_ContentLengthMbox 11.12 ms 0.082 ms 0.069 ms
MimeReader_JwzMbox 112.96 ms 0.406 ms 0.380 ms

if I enable my FUNROLL_LOOPS code section in MimeReader:

Method Mean Error StdDev
MimeParser_StarTrekMessage 234.72 ms 4.636 ms 6.028 ms
MimeParser_StarTrekMessagePersistent 205.03 ms 1.044 ms 0.872 ms
MimeParser_ContentLengthMbox 20.40 ms 0.112 ms 0.105 ms
MimeParser_ContentLengthMboxPersistent 19.00 ms 0.106 ms 0.094 ms
MimeParser_JwzMbox 164.66 ms 0.711 ms 0.630 ms
MimeParser_JwzMboxPersistent 150.02 ms 0.949 ms 0.888 ms
MimeReader_StarTrekMessage 173.12 ms 0.630 ms 0.559 ms
MimeReader_ContentLengthMbox 11.16 ms 0.084 ms 0.074 ms
MimeReader_JwzMbox 113.57 ms 0.560 ms 0.497 ms

Meh. Was hoping for a noticeable improvement, but it looks more-or-less like a nothing burger.

Just switching to .NET 5 is more of an improvement than anything I've been able to eke out with tweaks to the new header parser 😭

Note: I made the mbox parser benchmarks loop over the mbox 10x which is why the numbers are bigger than they were in the original benchmark posting.

@jstedfast
Copy link
Owner

I think I need to design a better benchmark that stresses mostly the header parser.

@jstedfast
Copy link
Owner

jstedfast commented Aug 26, 2021

Not bad... here are the results of parsing an mbox file consisting of a single looped message with lots of headers and a really short text/plain body:

Method Mean Error StdDev
MimeParser_HeaderStressTest 34.44 ms 0.180 ms 0.160 ms
MimeReader_HeaderStressTest 12.12 ms 0.107 ms 0.100 ms

For the MimeParser test, I set persistent = true to remove MemoryBlockStream memory copying for the message content.

Here are the results for the same test if I enable my FUNROLL_LOOPS code segment in StepHeaderValue() instead of using the simple while (*inptr != (byte) '\n') inptr++ loop:

Method Mean Error StdDev
MimeParser_HeaderStressTest 34.42 ms 0.120 ms 0.100 ms
MimeReader_HeaderStressTest 11.58 ms 0.084 ms 0.074 ms

Not a huge difference, but it's something...

jstedfast added a commit that referenced this issue Aug 26, 2021
The MimeReader is an alternative to the MimeParser.

Unlike the MimeParser, the MimeReader does not construct a MimeMessage
or MimeEntity "DOM". Instead, it works more like a SAX parser in that
it emits events as logical components of a MIME structure are parsed,
allowing the developer to process MIME data as it is parsed rather than
waiting for the entire message or entity to be completely constructed.

This also allows developers a way to reduce their memory overhead.

Implements the API discussed in issue #695
@jstedfast
Copy link
Owner

jstedfast commented Aug 26, 2021

Okay, here's our comparison data:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
Intel Core i7-9700 CPU 3.00GHz, 1 CPU, 8 logical and 8 physical cores
.NET SDK=5.0.400
[Host] : .NET 5.0.9 (5.0.921.35908), X64 RyuJIT [AttachedDebugger]
DefaultJob : .NET 5.0.9 (5.0.921.35908), X64 RyuJIT

Method Mean Error StdDev
MimeParser_HeaderStressTest 35.35 ms 0.131 ms 0.123 ms
MimeReader_HeaderStressTest1 18.25 ms 0.185 ms 0.173 ms
MimeReader_HeaderStressTest2 16.13 ms 0.085 ms 0.075 ms
MimeReader_HeaderStressTest3 12.29 ms 0.109 ms 0.102 ms
MimeReader_HeaderStressTest4 11.59 ms 0.095 ms 0.089 ms

Notes:

  • MimeReader_HeaderStressTest1 refers to the original MimeReader implementation that copied the MimeParser StepHeaders() implementation with minumal changes (if any).
  • MimeReader_HeaderStressTest2 refers to the first rewrite of StepHeaders() to allow OnHeaderRead() and OnHeaderReadAsync(). This version splots the logic into StepHeaderField() and STepHeaderValue() where each contains a loop (or loops) that iterate over the data looking for a : or \n while simultaneously copying the data into a byte[] (and having to check every loop to validate that the buffer is large enough).
  • MimeReader_HeaderStressTest3 refers to the second rewrite that uses scanning loops and then Buffer.BlockCopy() to minimize array bounds checking.
  • MimeReader_HeaderStressTest4 refers to a modification of previous iterations scanning loops to unroll them.

jstedfast added a commit that referenced this issue Aug 28, 2021
The MimeReader is an alternative to the MimeParser.

Unlike the MimeParser, the MimeReader does not construct a MimeMessage
or MimeEntity "DOM". Instead, it works more like a SAX parser in that
it emits events as logical components of a MIME structure are parsed,
allowing the developer to process MIME data as it is parsed rather than
waiting for the entire message or entity to be completely constructed.

This also allows developers a way to reduce their memory overhead.

Implements the API discussed in issue #695
jstedfast added a commit that referenced this issue Aug 29, 2021
The MimeReader is an alternative to the MimeParser.

Unlike the MimeParser, the MimeReader does not construct a MimeMessage
or MimeEntity "DOM". Instead, it works more like a SAX parser in that
it emits events as logical components of a MIME structure are parsed,
allowing the developer to process MIME data as it is parsed rather than
waiting for the entire message or entity to be completely constructed.

This also allows developers a way to reduce their memory overhead.

Implements the API discussed in issue #695
jstedfast added a commit that referenced this issue Aug 31, 2021
The MimeReader is an alternative to the MimeParser.

Unlike the MimeParser, the MimeReader does not construct a MimeMessage
or MimeEntity "DOM". Instead, it works more like a SAX parser in that
it emits events as logical components of a MIME structure are parsed,
allowing the developer to process MIME data as it is parsed rather than
waiting for the entire message or entity to be completely constructed.

This also allows developers a way to reduce their memory overhead.

Implements the API discussed in issue #695
jstedfast added a commit that referenced this issue Aug 31, 2021
The MimeReader is an alternative to the MimeParser.

Unlike the MimeParser, the MimeReader does not construct a MimeMessage
or MimeEntity "DOM". Instead, it works more like a SAX parser in that
it emits events as logical components of a MIME structure are parsed,
allowing the developer to process MIME data as it is parsed rather than
waiting for the entire message or entity to be completely constructed.

This also allows developers a way to reduce their memory overhead.

Implements the API discussed in issue #695
jstedfast added a commit that referenced this issue Sep 1, 2021
The MimeReader is an alternative to the MimeParser.

Unlike the MimeParser, the MimeReader does not construct a MimeMessage
or MimeEntity "DOM". Instead, it works more like a SAX parser in that
it emits events as logical components of a MIME structure are parsed,
allowing the developer to process MIME data as it is parsed rather than
waiting for the entire message or entity to be completely constructed.

This also allows developers a way to reduce their memory overhead.

Implements the API discussed in issue #695
jstedfast added a commit that referenced this issue Sep 1, 2021
The MimeReader is an alternative to the MimeParser.

Unlike the MimeParser, the MimeReader does not construct a MimeMessage
or MimeEntity "DOM". Instead, it works more like a SAX parser in that
it emits events as logical components of a MIME structure are parsed,
allowing the developer to process MIME data as it is parsed rather than
waiting for the entire message or entity to be completely constructed.

This also allows developers a way to reduce their memory overhead.

Implements the API discussed in issue #695
jstedfast added a commit that referenced this issue Sep 1, 2021
The MimeReader is an alternative to the MimeParser.

Unlike the MimeParser, the MimeReader does not construct a MimeMessage
or MimeEntity "DOM". Instead, it works more like a SAX parser in that
it emits events as logical components of a MIME structure are parsed,
allowing the developer to process MIME data as it is parsed rather than
waiting for the entire message or entity to be completely constructed.

This also allows developers a way to reduce their memory overhead.

Implements the API discussed in issue #695
jstedfast added a commit that referenced this issue Sep 2, 2021
The MimeReader is an alternative to the MimeParser.

Unlike the MimeParser, the MimeReader does not construct a MimeMessage
or MimeEntity "DOM". Instead, it works more like a SAX parser in that
it emits events as logical components of a MIME structure are parsed,
allowing the developer to process MIME data as it is parsed rather than
waiting for the entire message or entity to be completely constructed.

This also allows developers a way to reduce their memory overhead.

Implements the API discussed in issue #695
@jstedfast
Copy link
Owner

@nitinag FWIW, I've created a MimeKit 3.0 board to track progress of what I want to accomplish for v3.0.

jstedfast added a commit that referenced this issue Sep 6, 2021
The MimeReader is an alternative to the MimeParser.

Unlike the MimeParser, the MimeReader does not construct a MimeMessage
or MimeEntity "DOM". Instead, it works more like a SAX parser in that
it emits events as logical components of a MIME structure are parsed,
allowing the developer to process MIME data as it is parsed rather than
waiting for the entire message or entity to be completely constructed.

This also allows developers a way to reduce their memory overhead.

Implements the API discussed in issue #695
@jstedfast
Copy link
Owner

@nitinag I just finished implementing a new MimeParser on top of MimeReader...

Are you ready for the results? 😂

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
Intel Core i7-9700 CPU 3.00GHz, 1 CPU, 8 logical and 8 physical cores
.NET SDK=5.0.400
  [Host]     : .NET 5.0.9 (5.0.921.35908), X64 RyuJIT  [AttachedDebugger]
  DefaultJob : .NET 5.0.9 (5.0.921.35908), X64 RyuJIT


|                                             Method |       Mean |     Error |    StdDev |
|--------------------------------------------------- |-----------:|----------:|----------:|
|                         MimeParser_StarTrekMessage | 228.581 ms | 3.0492 ms | 2.8522 ms |
|               MimeParser_StarTrekMessagePersistent | 195.614 ms | 1.1403 ms | 0.9522 ms |
|                       MimeParser_ContentLengthMbox |  18.101 ms | 0.1275 ms | 0.1065 ms |
|             MimeParser_ContentLengthMboxPersistent |  16.392 ms | 0.0849 ms | 0.0753 ms |
|                                 MimeParser_JwzMbox | 172.222 ms | 1.1952 ms | 1.1180 ms |
|                       MimeParser_JwzMboxPersistent | 151.150 ms | 0.5734 ms | 0.5083 ms |
|                        MimeParser_HeaderStressTest |  26.184 ms | 0.0770 ms | 0.0683 ms |
|             ExperimentalMimeParser_StarTrekMessage | 212.104 ms | 0.7758 ms | 0.6878 ms |
|   ExperimentalMimeParser_StarTrekMessagePersistent | 183.906 ms | 0.8915 ms | 0.7903 ms |
|           ExperimentalMimeParser_ContentLengthMbox |  22.169 ms | 0.4429 ms | 1.0177 ms |
| ExperimentalMimeParser_ContentLengthMboxPersistent |  21.940 ms | 0.4388 ms | 1.1863 ms |
|                     ExperimentalMimeParser_JwzMbox | 244.491 ms | 3.1079 ms | 2.9071 ms |
|           ExperimentalMimeParser_JwzMboxPersistent | 244.751 ms | 2.3894 ms | 2.2350 ms |
|            ExperimentalMimeParser_HeaderStressTest |   1.694 ms | 0.0034 ms | 0.0028 ms |
|                         MimeReader_StarTrekMessage | 177.045 ms | 1.9746 ms | 1.8470 ms |
|                       MimeReader_ContentLengthMbox |  12.628 ms | 0.0347 ms | 0.0290 ms |
|                                 MimeReader_JwzMbox | 128.712 ms | 0.5835 ms | 0.5172 ms |
|                        MimeReader_HeaderStressTest |  11.689 ms | 0.0396 ms | 0.0351 ms |

@jstedfast
Copy link
Owner

Gotta figure out the following 2 - they are way slower for some reason.

|                                             Method |       Mean |     Error |    StdDev |
|--------------------------------------------------- |-----------:|----------:|----------:|
|                     ExperimentalMimeParser_JwzMbox | 244.491 ms | 3.1079 ms | 2.9071 ms |
|           ExperimentalMimeParser_JwzMboxPersistent | 244.751 ms | 2.3894 ms | 2.2350 ms |

And this seems too fast?

|                                             Method |       Mean |     Error |    StdDev |
|--------------------------------------------------- |-----------:|----------:|----------:|
|            ExperimentalMimeParser_HeaderStressTest |   1.694 ms | 0.0034 ms | 0.0028 ms |

@jstedfast
Copy link
Owner

Ok, there we go... ExperimentalMimeParser was always using persistent = false

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
Intel Core i7-9700 CPU 3.00GHz, 1 CPU, 8 logical and 8 physical cores
.NET SDK=5.0.400
  [Host]     : .NET 5.0.9 (5.0.921.35908), X64 RyuJIT  [AttachedDebugger]
  DefaultJob : .NET 5.0.9 (5.0.921.35908), X64 RyuJIT


|                                   Method |     Mean |   Error |  StdDev |
|----------------------------------------- |---------:|--------:|--------:|
|           ExperimentalMimeParser_JwzMbox | 183.6 ms | 3.22 ms | 4.72 ms |
| ExperimentalMimeParser_JwzMboxPersistent | 160.1 ms | 1.12 ms | 0.99 ms |

Darn, slower than the original MimeParser, but at least respectable. It's probably about on-par with the 2.x MimeParser (I have some perf improvements in my mimereader branch for MimeParser).

@jstedfast
Copy link
Owner

jstedfast commented Sep 8, 2021

Okay, here we are... final benchmark now that all the kinks are worked out:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
Intel Core i7-9700 CPU 3.00GHz, 1 CPU, 8 logical and 8 physical cores
.NET SDK=5.0.400
  [Host]     : .NET 5.0.9 (5.0.921.35908), X64 RyuJIT  [AttachedDebugger]
  DefaultJob : .NET 5.0.9 (5.0.921.35908), X64 RyuJIT


|                                             Method |      Mean |    Error |   StdDev |
|--------------------------------------------------- |----------:|---------:|---------:|
|                         MimeParser_StarTrekMessage | 232.25 ms | 1.043 ms | 0.925 ms |
|               MimeParser_StarTrekMessagePersistent | 201.63 ms | 1.930 ms | 1.711 ms |
|                       MimeParser_ContentLengthMbox |  18.85 ms | 0.138 ms | 0.123 ms |
|             MimeParser_ContentLengthMboxPersistent |  17.10 ms | 0.123 ms | 0.109 ms |
|                                 MimeParser_JwzMbox | 180.51 ms | 2.242 ms | 2.097 ms |
|                       MimeParser_JwzMboxPersistent | 157.71 ms | 1.073 ms | 0.951 ms |
|                        MimeParser_HeaderStressTest |  28.27 ms | 0.102 ms | 0.091 ms |
|             ExperimentalMimeParser_StarTrekMessage | 235.22 ms | 1.509 ms | 1.337 ms |
|   ExperimentalMimeParser_StarTrekMessagePersistent | 202.72 ms | 0.973 ms | 0.910 ms |
|           ExperimentalMimeParser_ContentLengthMbox |  18.85 ms | 0.111 ms | 0.098 ms |
| ExperimentalMimeParser_ContentLengthMboxPersistent |  17.02 ms | 0.076 ms | 0.071 ms |
|                     ExperimentalMimeParser_JwzMbox | 182.40 ms | 1.237 ms | 1.157 ms |
|           ExperimentalMimeParser_JwzMboxPersistent | 158.12 ms | 0.913 ms | 0.854 ms |
|            ExperimentalMimeParser_HeaderStressTest |  21.42 ms | 0.161 ms | 0.150 ms |
|                         MimeReader_StarTrekMessage | 181.54 ms | 2.020 ms | 1.890 ms |
|                       MimeReader_ContentLengthMbox |  13.06 ms | 0.042 ms | 0.039 ms |
|                                 MimeReader_JwzMbox | 133.09 ms | 1.241 ms | 1.161 ms |
|                        MimeReader_HeaderStressTest |  11.99 ms | 0.045 ms | 0.042 ms |

Looks like we more-or-less have a tie between the MImeParser and ExperimentalMimeParser.

@jstedfast
Copy link
Owner

jstedfast commented Sep 8, 2021

The JWZ mbox file is 10.2 MB and the becnhmark parses the mbox file 100 times. If I am mathing correctly, that's about 1 GB/s throughput from my parser.

@jstedfast
Copy link
Owner

Oops, I did it again!

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
Intel Core i7-9700 CPU 3.00GHz, 1 CPU, 8 logical and 8 physical cores
.NET SDK=5.0.400
  [Host]     : .NET 5.0.9 (5.0.921.35908), X64 RyuJIT  [AttachedDebugger]
  DefaultJob : .NET 5.0.9 (5.0.921.35908), X64 RyuJIT

|                                             Method |      Mean |    Error |   StdDev |
|--------------------------------------------------- |----------:|---------:|---------:|
|                         MimeParser_StarTrekMessage | 232.25 ms | 1.043 ms | 0.925 ms |
|             ExperimentalMimeParser_StarTrekMessage | 223.45 ms | 1.079 ms | 1.009 ms |
|                         MimeReader_StarTrekMessage | 176.73 ms | 0.498 ms | 0.442 ms |

|                                             Method |      Mean |    Error |   StdDev |
|--------------------------------------------------- |----------:|---------:|---------:|
|               MimeParser_StarTrekMessagePersistent | 201.63 ms | 1.930 ms | 1.711 ms |
|   ExperimentalMimeParser_StarTrekMessagePersistent | 194.39 ms | 0.847 ms | 0.792 ms |

|                                             Method |      Mean |    Error |   StdDev |
|--------------------------------------------------- |----------:|---------:|---------:|
|                       MimeParser_ContentLengthMbox |  18.85 ms | 0.138 ms | 0.123 ms |
|           ExperimentalMimeParser_ContentLengthMbox |  18.16 ms | 0.128 ms | 0.120 ms |
|                       MimeReader_ContentLengthMbox |  12.42 ms | 0.041 ms | 0.034 ms |

|                                             Method |      Mean |    Error |   StdDev |
|--------------------------------------------------- |----------:|---------:|---------:|
|             MimeParser_ContentLengthMboxPersistent |  17.10 ms | 0.123 ms | 0.109 ms |
| ExperimentalMimeParser_ContentLengthMboxPersistent |  16.31 ms | 0.115 ms | 0.102 ms |

|                                             Method |      Mean |    Error |   StdDev |
|--------------------------------------------------- |----------:|---------:|---------:|
|                                 MimeParser_JwzMbox | 180.51 ms | 2.242 ms | 2.097 ms |
|                     ExperimentalMimeParser_JwzMbox | 171.97 ms | 0.570 ms | 0.533 ms |
|                                 MimeReader_JwzMbox | 129.92 ms | 0.637 ms | 0.564 ms |

|                                             Method |      Mean |    Error |   StdDev |
|--------------------------------------------------- |----------:|---------:|---------:|
|                       MimeParser_JwzMboxPersistent | 157.71 ms | 1.073 ms | 0.951 ms |
|           ExperimentalMimeParser_JwzMboxPersistent | 150.19 ms | 0.424 ms | 0.376 ms |

|                                             Method |      Mean |    Error |   StdDev |
|--------------------------------------------------- |----------:|---------:|---------:|
|                        MimeParser_HeaderStressTest |  28.27 ms | 0.102 ms | 0.091 ms |
|            ExperimentalMimeParser_HeaderStressTest |  20.35 ms | 0.243 ms | 0.227 ms |
|                        MimeReader_HeaderStressTest |  12.40 ms | 0.127 ms | 0.119 ms |

I remembered some optimizations that I had made to MimeParser and applied those same things to MimeReader and now ExperimentalMimeParser is always faster than MimeParser.

Oh, and I should mention that MimeParser in the mimereader branch is also faster than MimeKit 2.15.0's MImeParser :)

@nitinag
Copy link

nitinag commented Sep 9, 2021

That's awesome work, looking forward to using MimeReader!

jstedfast added a commit that referenced this issue Sep 9, 2021
The MimeReader is an alternative to the MimeParser.

Unlike the MimeParser, the MimeReader does not construct a MimeMessage
or MimeEntity "DOM". Instead, it works more like a SAX parser in that
it emits events as logical components of a MIME structure are parsed,
allowing the developer to process MIME data as it is parsed rather than
waiting for the entire message or entity to be completely constructed.

This also allows developers a way to reduce their memory overhead.

Implements the API discussed in issue #695
jstedfast added a commit that referenced this issue Sep 14, 2021
The MimeReader is an alternative to the MimeParser.

Unlike the MimeParser, the MimeReader does not construct a MimeMessage
or MimeEntity "DOM". Instead, it works more like a SAX parser in that
it emits events as logical components of a MIME structure are parsed,
allowing the developer to process MIME data as it is parsed rather than
waiting for the entire message or entity to be completely constructed.

This also allows developers a way to reduce their memory overhead.

Implements the API discussed in issue #695
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants