Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InternetAddress parsing is unacceptably slow #126

Closed
sigasigasiga opened this issue Aug 16, 2022 · 27 comments
Closed

InternetAddress parsing is unacceptably slow #126

sigasigasiga opened this issue Aug 16, 2022 · 27 comments
Labels
performance Improvements to speed or memory consumption

Comments

@sigasigasiga
Copy link
Contributor

sigasigasiga commented Aug 16, 2022

As subject says, InternetAddress parsing is very slow.

If I try to parse an email with like 30000 recipients, it takes about 800 seconds to finish on my machine (i7-3770).

I took a look at what Callgrind says:

 14,947,395  /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:_int_free [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
  7,700,026  /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:malloc [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
  6,476,923  /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:_int_malloc [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
  4,720,635  ???:addrspec_parse [/opt/drweb.com/lib/x86_64-linux-gnu/libgmime-3.0.so.0.206.1]
  4,436,250  /build/glibc-SzIz7B/glibc-2.31/elf/dl-lookup.c:_dl_lookup_symbol_x [/usr/lib/x86_64-linux-gnu/ld-2.31.so]
  4,310,129  /build/glibc-SzIz7B/glibc-2.31/malloc/malloc.c:free [/usr/lib/x86_64-linux-gnu/libc-2.31.so]
  3,952,572  ???:g_slice_alloc [/opt/drweb.com/lib/x86_64-linux-gnu/libgmime-3.0.so.0.206.1]

And we can clearly see that all the slowness comes from addrspec_parse function, where GString is made for every recipient, and then it does reallocations many many times.

Currently I can think of only 2 ways to somehow fix this:

  1. Write some sort of StringBuilder and use it to create resulting address in order to reduce reallocation time
  2. Parse all addresses lazily, only when they are actually requested
@sigasigasiga
Copy link
Contributor Author

Whenever a solution is chosen, I can try implement it in a library

@jstedfast
Copy link
Owner

GString is essentially a StringBuilder (if you are referring to the .NET StringBuilder class).

@jstedfast
Copy link
Owner

g_string_new ("") calls g_string_sized_new(2), but g_string_sized_new() uses MAX(dfl_size, 64), so it should allocate with an initial buffer of 64 bytes.

https://github.com/GNOME/glib/blob/main/glib/gstring.c#L108

That should be plenty for your addrspec (which is 24 characters), so it shouldn't need to resize.

@jstedfast
Copy link
Owner

It might be possible to refactor the code to re-use the same GString over and over for building the addrspecs (just use g_string_truncate(str, 0) at the start of addrspec_parse) and then just g_malloc()/g_strlcpy() the buffer into a new char* string.

@sigasigasiga
Copy link
Contributor Author

It might be possible to refactor the code to re-use the same GString over and over for building the addrspecs (just use g_string_truncate(str, 0) at the start of addrspec_parse) and then just g_malloc()/g_strlcpy() the buffer into a new char* string.

That's an interesting idea!
Unfortunately, it doesn't seem to give a lot of improvement. After applying this patch with quick-and-dirty implementation of your idea, I got following results:

bychin ~/d/master (master)
> time ./bin/test ~/debug/10k.eml # original implementation

________________________________________________________
Executed in   42,81 secs    fish           external
   usr time   42,80 secs  597,00 micros   42,80 secs
   sys time    0,00 secs   70,00 micros    0,00 secs

bychin ~/d/master (master)
> time ./bin/test ~/debug/10k.eml # static GString

________________________________________________________
Executed in   39,71 secs    fish           external
   usr time   39,69 secs    0,00 micros   39,69 secs
   sys time    0,01 secs  819,00 micros    0,01 secs

@jstedfast
Copy link
Owner

Here's an implementation of my idea, but it seems to somehow be slower on my macOS box (arm64):

diff --git a/gmime/internet-address.c b/gmime/internet-address.c
index ae07d13f..08209973 100644
--- a/gmime/internet-address.c
+++ b/gmime/internet-address.c
@@ -98,7 +98,7 @@ enum {
 	INTERNET_ADDRESS_FOLD   = 1 << 1,
 };
 
-static gboolean addrspec_parse (const char **in, const char *sentinels, char **addrspec, int *at);
+static gboolean addrspec_parse (GString *buffer, const char **in, const char *sentinels, char **addrspec, int *at);
 
 static void internet_address_class_init (InternetAddressClass *klass);
 static void internet_address_init (InternetAddress *ia, InternetAddressClass *klass);
@@ -387,14 +387,18 @@ internet_address_mailbox_new (const char *name, const char *addr)
 {
 	InternetAddressMailbox *mailbox;
 	const char *inptr = addr;
+	GString *buffer;
 	
 	g_return_val_if_fail (addr != NULL, NULL);
 	
 	mailbox = g_object_new (INTERNET_ADDRESS_TYPE_MAILBOX, NULL);
+	buffer = g_string_sized_new (64);
 
-	if (!addrspec_parse (&inptr, "", &mailbox->addr, &mailbox->at))
+	if (!addrspec_parse (buffer, &inptr, "", &mailbox->addr, &mailbox->at))
 		mailbox->addr = g_strdup (addr);
-	
+
+	g_string_free (buffer, TRUE);
+
 	_internet_address_set_name ((InternetAddress *) mailbox, name);
 	
 	return (InternetAddress *) mailbox;
@@ -412,6 +416,7 @@ void
 internet_address_mailbox_set_addr (InternetAddressMailbox *mailbox, const char *addr)
 {
 	const char *inptr = addr;
+	GString *buffer;
 	
 	g_return_if_fail (INTERNET_ADDRESS_IS_MAILBOX (mailbox));
 	
@@ -423,9 +428,13 @@ internet_address_mailbox_set_addr (InternetAddressMailbox *mailbox, const char *
 	
 	g_free (mailbox->addr);
 	
-	if (!addrspec_parse (&inptr, "", &mailbox->addr, &mailbox->at))
+	buffer = g_string_sized_new (64);
+
+	if (!addrspec_parse (buffer, &inptr, "", &mailbox->addr, &mailbox->at))
 		mailbox->addr = g_strdup (addr);
-	
+
+	g_string_free (buffer, TRUE);
+
 	g_mime_event_emit (((InternetAddress *) mailbox)->changed, NULL);
 }
 
@@ -1632,19 +1641,26 @@ domain_parse (GString *str, const char **in, const char *sentinels)
 	return dotatom_parse (str, in, sentinels);
 }
 
+static char *
+buffer_strdup (GString *buffer)
+{
+	char *str = g_malloc (buffer->len + 1);
+	strcpy (str, buffer->str);
+	return str;
+}
+
 static gboolean
-addrspec_parse (const char **in, const char *sentinels, char **addrspec, int *at)
+addrspec_parse (GString *buffer, const char **in, const char *sentinels, char **addrspec, int *at)
 {
 	const char *inptr = *in;
-	GString *str;
-	
-	str = g_string_new ("");
-	
-	if (!localpart_parse (str, &inptr))
+
+	g_string_truncate (buffer, 0);
+
+	if (!localpart_parse (buffer, &inptr))
 		goto error;
 	
 	if (*inptr == '\0' || strchr (sentinels, *inptr)) {
-		*addrspec = g_string_free (str, FALSE);
+		*addrspec = buffer_strdup (buffer);
 		*in = inptr;
 		*at = -1;
 		return TRUE;
@@ -1653,8 +1669,8 @@ addrspec_parse (const char **in, const char *sentinels, char **addrspec, int *at
 	if (*inptr != '@')
 		goto error;
 	
-	*at = str->len;
-	g_string_append_c (str, *inptr++);
+	*at = buffer->len;
+	g_string_append_c (buffer, *inptr++);
 	
 	if (*inptr == '\0')
 		goto error;
@@ -1665,16 +1681,15 @@ addrspec_parse (const char **in, const char *sentinels, char **addrspec, int *at
 	if (*inptr == '\0')
 		goto error;
 	
-	if (!domain_parse (str, &inptr, sentinels))
+	if (!domain_parse (buffer, &inptr, sentinels))
 		goto error;
 	
-	*addrspec = g_string_free (str, FALSE);
+	*addrspec = buffer_strdup (buffer);
 	*in = inptr;
 	
 	return TRUE;
 	
  error:
-	g_string_free (str, TRUE);
 	*addrspec = NULL;
 	*in = inptr;
 	*at = -1;
@@ -1684,7 +1699,7 @@ addrspec_parse (const char **in, const char *sentinels, char **addrspec, int *at
 
 // TODO: rename to angleaddr_parse??
 static gboolean
-mailbox_parse (GMimeParserOptions *options, const char **in, const char *name, InternetAddress **address)
+mailbox_parse (GMimeParserOptions *options, GString *buffer, const char **in, const char *name, InternetAddress **address)
 {
 	GMimeRfcComplianceMode mode = g_mime_parser_options_get_address_compliance_mode (options);
 	const char *inptr = *in;
@@ -1728,7 +1743,7 @@ mailbox_parse (GMimeParserOptions *options, const char **in, const char *name, I
 	// ';' as well in case the mailbox is within a group address.
 	//
 	// Example: <[email protected], [email protected]>
-	if (!addrspec_parse (&inptr, COMMA_GREATER_THAN_OR_SEMICOLON, &addrspec, &at))
+	if (!addrspec_parse (buffer, &inptr, COMMA_GREATER_THAN_OR_SEMICOLON, &addrspec, &at))
 		goto error;
 	
 	if (!skip_cfws (&inptr))
@@ -1797,7 +1812,7 @@ group_parse (InternetAddressGroup *group, GMimeParserOptions *options, const cha
 }
 
 static gboolean
-address_parse (GMimeParserOptions *options, AddressParserFlags flags, const char **in, const char **charset, InternetAddress **address, gint64 offset)
+address_parse (GMimeParserOptions *options, AddressParserFlags flags, GString *buffer, const char **in, const char **charset, InternetAddress **address, gint64 offset)
 {
 	GMimeRfcComplianceMode mode = g_mime_parser_options_get_address_compliance_mode (options);
 	int min_words = g_mime_parser_options_get_allow_addresses_without_domain (options) ? 1 : 0;
@@ -1885,7 +1900,7 @@ address_parse (GMimeParserOptions *options, AddressParserFlags flags, const char
 		if (!(flags & ALLOW_MAILBOX))
 			goto error;
 		
-		if (!addrspec_parse (&inptr, sentinels, &addrspec, &at))
+		if (!addrspec_parse (buffer, &inptr, sentinels, &addrspec, &at))
 			goto error;
 		
 		skip_lwsp (&inptr);
@@ -1961,7 +1976,7 @@ address_parse (GMimeParserOptions *options, AddressParserFlags flags, const char
 		/* rewind back to the beginning of the local-part */
 		inptr = start;
 		
-		if (!addrspec_parse (&inptr, COMMA_GREATER_THAN_OR_SEMICOLON, &addrspec, &at))
+		if (!addrspec_parse (buffer, &inptr, COMMA_GREATER_THAN_OR_SEMICOLON, &addrspec, &at))
 			goto error;
 		
 		skip_lwsp (&inptr);
@@ -2048,7 +2063,7 @@ address_parse (GMimeParserOptions *options, AddressParserFlags flags, const char
 			name = g_strdup ("");
 		}
 		
-		retval = mailbox_parse (options, &inptr, name, address);
+		retval = mailbox_parse (options, buffer, &inptr, name, address);
 		g_free (name);
 		*in = inptr;
 		
@@ -2072,6 +2087,7 @@ address_list_parse (InternetAddressList *list, GMimeParserOptions *options, cons
 	InternetAddress *address;
 	const char *charset;
 	const char *inptr;
+	GString *buffer;
 	
 	if (!skip_cfws (in))
 		return FALSE;
@@ -2080,7 +2096,9 @@ address_list_parse (InternetAddressList *list, GMimeParserOptions *options, cons
 	
 	if (*inptr == '\0')
 		return FALSE;
-	
+
+	buffer = g_string_sized_new (64);
+
 	while (*inptr) {
 		gboolean separator_between_addrs = FALSE;
 
@@ -2089,7 +2107,7 @@ address_list_parse (InternetAddressList *list, GMimeParserOptions *options, cons
 		
 		charset = NULL;
 		
-		if (!address_parse (options, ALLOW_ANY, &inptr, &charset, &address, offset)) {
+		if (!address_parse (options, ALLOW_ANY, buffer, &inptr, &charset, &address, offset)) {
 			/* skip this address... */
 			while (*inptr && *inptr != ',' && (!is_group || *inptr != ';'))
 				inptr++;
@@ -2106,6 +2124,7 @@ address_list_parse (InternetAddressList *list, GMimeParserOptions *options, cons
 		/* Note: we loop here in case there are any null addresses between commas */
 		do {
 			if (!skip_cfws (&inptr)) {
+				g_string_free (buffer, TRUE);
 				*in = inptr;
 				
 				return FALSE;
@@ -2121,7 +2140,9 @@ address_list_parse (InternetAddressList *list, GMimeParserOptions *options, cons
 		if (can_warn && !(separator_between_addrs || (*inptr == '\0') || (is_group && *inptr == ';')))
 			_g_mime_parser_options_warn (options, offset, GMIME_WARN_INVALID_ADDRESS_LIST, *in);
 	}
-	
+
+	g_string_free (buffer, TRUE);
+
 	*in = inptr;
 	
 	return TRUE;

@jstedfast
Copy link
Owner

@sigasigasiga my patch should theoretically get better perf than yours since g_strdup() has to call strlen() on the string, then malloc(), then strcpy() whereas mine just uses the str->len (well, I renamed it to buffer, but same thing) to avoid the strlen() call.

That said, I also did not experience any improvement (in fact, it somehow got slower?).

@sigasigasiga
Copy link
Contributor Author

@jstedfast I have another idea how to reduce allocations: store addresses in a GString, where each address is separated by \0:

+---+---+---+---+---+----+---+---+---+---+---+----+
| a | @ | b | . | c | \0 | d | @ | e | . | f | \0 |
+---+---+---+---+---+----+---+---+---+---+---+----+

This way GString would become big enough pretty quickly, and thus won't reallocate that often.
Resulting GString probably could be stored inside InternetAddressList, but I'm not sure about that.

@jstedfast
Copy link
Owner

Most likely there's something more costly than the allocations

@jstedfast
Copy link
Owner

Tested a theory I had which is that it's not the parser that is slow at all, but rather the fact that there is a separate To: header for each set of 3 addresses, which means we parse hundreds of short lists of addresses and then concatenate them.

If I modify the message such that all of the addresses are in a single To: header, the performance is REALLY fast:

real	0m0.476s
user	0m0.067s
sys	0m0.051s

By comparison, this is what I get w/ your original submitted test message:

real	2m23.087s
user	2m22.083s
sys	0m0.944s

@jstedfast
Copy link
Owner

Here's the modified message that puts all of the addresses in a single To: header.

testMail2.txt

jstedfast added a commit that referenced this issue Aug 17, 2022
…ations

If a message has a ton of To: headers, for example, we were allocating a new
InternetAddressList for each To: header and concatenating them onto a single
list after parsing them.

This improves performance by ~20% for those types of cases.

Partial fix for issue #126
@jstedfast
Copy link
Owner

With the above patch applied, I now get:

real	1m54.082s
user	1m53.201s
sys	0m0.841s

@jstedfast
Copy link
Owner

I think I know how to fix the rest of the performance problem... it'll just take some time because I don't have a lot of free time to work on this.

@jstedfast
Copy link
Owner

Nevermind, wasn't that hard.

@jstedfast jstedfast added the performance Improvements to speed or memory consumption label Aug 17, 2022
@sigasigasiga
Copy link
Contributor Author

Wow, that's really impressive! Thank you!

@jstedfast
Copy link
Owner

New timings:

real	0m0.088s
user	0m0.047s
sys	0m0.024s

@jstedfast
Copy link
Owner

Oops, fixed the build (I had accidentally committed part of another local optimization idea that went unused)

@sigasigasiga
Copy link
Contributor Author

@jstedfast would you mind creating a new release with this fix in it?

@jstedfast
Copy link
Owner

released 3.2.13

@sigasigasiga
Copy link
Contributor Author

Thank you very much!

@flokli
Copy link

flokli commented Aug 19, 2022

Did this change some behaviour? Running the notmuch testsuite with 3.2.13 exposes the following failure:

NixOS/nixpkgs#187365 (comment)

@jstedfast
Copy link
Owner

It shouldn't have

@dkg
Copy link
Contributor

dkg commented Oct 4, 2022

@flokli is correct, this does break the notmuch test suite, see #129 for more detail

@dkg
Copy link
Contributor

dkg commented Oct 4, 2022

I recommend re-opening this issue, because i don't think the changes made to resolve matters addressed the underlying problem; rather, they just worked optimized for a specific sample message.

In particular, i think the problem is that the interaction in gmime-message.c beween message_update_addresses and process_header is a O(N²) operation.

I might be misunderstanding the code here, but it looks to me like as a parser passes through a message's header fields, it invokes message_header_added for each header (that is, each (name, value) pair it encounters in the message's header section), which in turn invokes process_header.

In the old code, which invoked only message_update_addresses when a specific address field was present, each time this is invoked, it clears the internal list of addresses, and then walks the entire list of headers in the message to find every matching element.

So if you have one Cc field, loading it causes a single scan of all headers, plucking out the contents of the relevant field. but if you have multiple Cc fields, each time the parser encounters a one of them, it does another full scan through all headers. So if you have N Cc fields, there will be N invocations of the message_header_addedprocess_headermessage_update_addresses chain, and each of these inner functions will in turn do a full O(N) scan of the entire header list known by the message.

So it's not surprising that a message with 1000 Cc fields might take on the order of 1000000× the compute power to parse compared to a message with only one Cc field. I'm not sure how to fix this code, because it's not clear to me what the differences are between message->addrlists[type] and the contents of message->headers. If message->addrlists[type] is supposed to be some sort of consolidated, cached version describing all of the headers of field type, then what are the cache coherency expectations from this codebase? Do we really need to re-scan all headers each time another header comes in?

@jstedfast
Copy link
Owner

jstedfast commented Oct 4, 2022

In particular, i think the problem is that the interaction in gmime-message.c beween message_update_addresses and process_header is a O(N²) operation.

With the old code, that was exactly the problem. The process_header() method called message_update_addresses() method for each and every Cc header that was added to the message headers.

The new logic checks to see if the header was appended (vs inserted), and if it was appended, it calls message_add_addresses() which only parses the new header and appends it to the existing InternetAddressList. The message_update_addresses() method is only called now if the header was inserted (because inserting new Cc addresses into the InternetAddressList in the appropriate place is impossible in that we currently do not hold enough state to know where the appropriate location to insert them would be).

Thus, the new implementation is O(N) instead of O(N²). This was done in commit 4a80ae5.

@dkg
Copy link
Contributor

dkg commented Oct 5, 2022

I understand that the "new implementation" should be O(N), but the "old implementation" is still there, right?

What's stopping it from getting triggered is the logic in header_was_appended which only triggers if the header loaded happens to be the final item in object->headers -- so this is basically trying to detect whether the code is being run from within the current implementation of g_mime_parser_construct_message, without explicitly doing that detection.

This feels a bit like it's laying a small booby trap in the code -- a reader can only understand why it's doing it this way if they see how the different pieces fit together, and it's likely to be brittle if someone makes a change (to either piece of code).

If someone tries to build a message from scratch in a weird way, for example, by building a header block that ends in Subject: but then inserting an address header just before it, they'll still run into the same O(N²) behavior. (granted, since RFC 5322, they shouldn't be inserting multiple address headers of the same name anyway, so maybe this is a "don't do that then" situation)

I don't really have a good way forward to propose concretely, so i'll understand if you just decide to leave it as-is. I just wanted to make sure i'm understanding the situation clearly. Thanks for following up about this!

@jstedfast
Copy link
Owner

What's stopping it from getting triggered is the logic in header_was_appended which only triggers if the header loaded happens to be the final item in object->headers -- so this is basically trying to detect whether the code is being run from within the current implementation of g_mime_parser_construct_message, without explicitly doing that detection.

(Mostly) True, but that particular use-case is the hardest-hit just due to the fact that the parser adds a whole slew of headers to the message in a loop (and in particular, the address headers which must be combined).

The other case this handles (hence the "Mostly" above) is developers doing g_mime_header_list_append(message->headers, "Cc", "Charles <[email protected]>", NULL) and/or g_mime_header_list_set(message->headers, "Cc", "Charles <[email protected]>", NULL) where the header being set (in this case, Cc) doesn't exist.

As you correctly point out, if they do g_mime_header_list_prepend(message->headers, ...), this starts to hit O(N²) again (for address-type headers).

That said, there probably aren't many GMime developers who are doing that. Most likely just let GMimeMessage handle serialization of addresses.

The exception to this is most likely to be prepending "Received" headers (which I know GMime is used for outside of notmuch and other user-facing email clients). Luckily, GMime will not hit any sort of O(N²) situation for Received headers.

granted, since RFC 5322, they shouldn't be inserting multiple address headers of the same name anyway, so maybe this is a "don't do that then" situation)

I actually happen to like the older RFCs for allowing multiple To/Cc/Bcc/etc headers, I think having multiple headers helps to make the raw headers look nicer than 1 address header with a huge list of folded addresses. But that's just my quirkiness, I'm sure :-)

I don't really have a good way forward to propose concretely, so i'll understand if you just decide to leave it as-is. I just wanted to make sure i'm understanding the situation clearly.

There's 2 more things we can do that I've already done in MimeKit:

  1. Instead of having the parser add headers 1 at a time to the GMimeMessage & GMimeObjects, MimeKit passes the equivalent of the full GMimeHeaderList to the new GMimeObject which can then initialize its cached header values.
  2. "Lazy-load" (or, more accurately, lazy-parse/lazy-decode) the GMimeHeader(s) only when requested via APIs like g_mime_message_get_subject(), g_mime_message_get_to(), etc.

The first optimization is what helped make MimeKit about as fast as GMime back in the early days of MimeKit development.

The second optimization is more recent (part of MimeKit v3.0 development) and made a pretty massive performance improvement in MimeKit's MimeParser when I implemented that (not really a surprise seeing as how it defers a ton of expensive parsing until later when the user actually wants to get access to those parsed addresses/etc).

That work was done as part of this optimization effort: jstedfast/MimeKit#695 (comment)

I'm sure there's a ton more optimizations we could pull from that since MimeKit and GMime are so similarly designed.

But anyway, yea... current MimeKit v3.x versions have 3 parsers:

  1. The old MimeParser.cs which was more-or-less based on GMimeParser
  2. MimeReader.cs which is more SAX-like and restructured a few things to improve performance
  3. ExperimentalMimeParser.cs which is the newer generation built on top of MimeReader.

The thread that I linked above has a bunch of benchmark results that show the gains made by the newer parsers.

Thanks for following up about this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improvements to speed or memory consumption
Projects
None yet
Development

No branches or pull requests

4 participants