Improve MySQLnd quote escaping performance #13466

nielsdos · 2024-02-21T22:31:35Z

Commits are split for easier reviewing, and I would recommend to review them in order one by one.
I would commit them one by one separately upon approval.

For this script:

$pdo = new PDO("mysql:host=$hostname", $username, $password);
for ($i = 0; $i < 1000 * 1000; $i++) {
    $pdo->quote('this is a test with a relatively short \"string\"...');
}

Benchmark 1: ../php-src-multitasking/sapi/cli/php_old gh13440.php
  Time (mean ± σ):     313.3 ms ±   3.3 ms    [User: 311.3 ms, System: 1.4 ms]
  Range (min … max):   310.4 ms … 321.6 ms    10 runs
 
Benchmark 2: ../php-src-multitasking/sapi/cli/php gh13440.php
  Time (mean ± σ):     129.7 ms ±   2.0 ms    [User: 125.5 ms, System: 3.7 ms]
  Range (min … max):   127.5 ms … 135.7 ms    21 runs
 
Summary
  ../php-src-multitasking/sapi/cli/php gh13440.php ran
    2.42 ± 0.05 times faster than ../php-src-multitasking/sapi/cli/php_old gh13440.php

For this script:

$pdo = new PDO("mysql:host=$hostname", $username, $password);
$str = str_repeat('\xff', 30);
for ($i = 0; $i < 1000 * 1000; $i++) {
    $pdo->quote($str);
}

Benchmark 1: ../php-src-multitasking/sapi/cli/php_old gh13440.php
  Time (mean ± σ):     700.5 ms ±   5.5 ms    [User: 695.4 ms, System: 3.9 ms]
  Range (min … max):   692.6 ms … 713.9 ms    10 runs
 
Benchmark 2: ../php-src-multitasking/sapi/cli/php gh13440.php
  Time (mean ± σ):     229.4 ms ±   2.6 ms    [User: 224.1 ms, System: 4.6 ms]
  Range (min … max):   225.8 ms … 235.6 ms    12 runs
 
Summary
  ../php-src-multitasking/sapi/cli/php gh13440.php ran
    3.05 ± 0.04 times faster than ../php-src-multitasking/sapi/cli/php_old gh13440.php

Girgias · 2024-02-21T22:59:48Z

ext/mysqlnd/mysqlnd_charset.c

+		/* check unicode characters
+		 * Encodings that have a minimum length of 1 are compatible with ASCII.
+		 * So we can skip (for performance reasons) the check to mb_valid for them. */
+		if (cset->char_maxlen > 1 && (*((zend_uchar *) escapestr) > 0x80 || cset->char_minlen > 1) && (len = cset->mb_valid(escapestr, end))) {


Does it make sense to check for the UTF-8 flag on zend_strings or not?

Good idea, but we also have to check that the charset is utf8mb4. It would require breaking the public API though as mysqlnd_cset_escape_quotes does not take zend_string*. I'm not sure what impact this would have on 3rd party drivers.

So TIL that we expose mysqln driver functions. Let's keep this suggestion for a follow-up PR.

SakiTakamachi · 2024-02-22T00:15:21Z

ext/mysqlnd/mysqlnd_charset.c

+		if (cset->char_maxlen > 1 && (*((zend_uchar *) escapestr) > 0x80 || cset->char_minlen > 1)) {
+			unsigned int len = cset->mb_valid(escapestr, end);


It looked like it could be skipped in the same way in mysqlnd_cset_escape_quotes, but isn't there much of a change?

You are right, I factored out the check so I could reuse it for this function.

staabm · 2024-02-22T08:12:13Z

really cool, thank you sooo much. do you think we can/could/should do similar things for

        for ($i = 0; $i < 1000 * 1000; $i++) {
            mysqli_real_escape_string('this is a test with a relatively short \"string\"...');
        }

?

kamil-tekiela · 2024-02-22T17:11:37Z

really cool, thank you sooo much. do you think we can/could/should do similar things for
        for ($i = 0; $i < 1000 * 1000; $i++) {
            mysqli_real_escape_string('this is a test with a relatively short \"string\"...');
        }
?

Probably the PR's name is a bit misleading. If PDO is using mysqlnd then it's the same code. If PDO uses libmysql then we have no control over how the escaping function works.

nielsdos · 2024-02-22T17:25:11Z

You're right, renamed.

kamil-tekiela · 2024-02-22T18:08:42Z

@nielsdos I am trying to understand the old code as well as your changes. I have to admit I am having a hard time. The mysqlnd code is not well documented and the comments and method names are vague. However, it was all based on the libmysql code, so I am checking it to see how it's supposed to work. For example, https://github.com/mysql/mysql-server/blob/trunk/mysys/charset.cc#L439 but I still do not understand how it works. Maybe it will help you?

See also SQL injection that gets around mysql_real_escape_string() for an explanation of how this function prevents this attack.

kamil-tekiela · 2024-02-22T18:23:46Z

One thing I just cannot understand. In the function declaration it says:

static unsigned int mysqlnd_mbcharlen_gb18030(const unsigned int c)

but we are not passing an integer to it, right? How does that work?

nielsdos · 2024-02-22T20:00:41Z

For example, https://github.com/mysql/mysql-server/blob/trunk/mysys/charset.cc#L439 but I still do not understand how it works. Maybe it will help you?

So it looks like we can hit that, but we don't actually seem to test that in PHP's test suite. I read the code you linked and I think it clicked now.

I'm going to reuse their example, but I'll try to explain it with more words why the character length check is there.

Assume we are using the GBK encoding.
Assume our input is 0xbf 0x27.
Assume we don't do the character length check, so we can see what would happen.

First iteration, i.e. *from == 0xbf. This isn't a valid GBK multibyte sequence start, so the validity check here fails: https://github.com/mysql/mysql-server/blob/824e2b4064053f7daf17d7f3f84b7a3ed92e5fb4/mysys/charset.cc#L423
Without the character length check, we'd check if we need to escape the current character 0xbf. The character 0xbf isn't handled in the switch case so we don't escape it and append 0xbf to the output buffer.
Second iteration, i.e. *from == 0x27. This isn't a valid start either, so we go to the escape logic. Note that 0x27 is the character ', so we have to escape! We write two bytes to the output: \ (this is 0x5c) and ' (this is 0x27).
The function finished, let's look at the output: 0xbf 0x5c 0x27. Now we actually made a problem: 0xbf 0x5c is a valid GBK multibyte sequence! So we transformed an invalid multibyte sequences into a valid one, potentially corrupting data. The solution is to check whether it could have been part of a multibyte sequence, but the checks are less strict.

but we are not passing an integer to it, right? How does that work?

I think this is just an API oddity. Looking at the mysql code:

https://github.com/mysql/mysql-server/blob/824e2b4064053f7daf17d7f3f84b7a3ed92e5fb4/strings/ctype.cc#L972

this passes only the current byte to my_mbcharlen, which is similar what we do.

kamil-tekiela · 2024-02-22T20:10:31Z

Thank you. Your explanation actually made it clear to me. I tested right now and it works exactly how you described.

Girgias · 2024-02-23T17:00:28Z

Could you add a test for GBK example you provided? Don't think you would need a database connection proper just to initialize a PdoMySQL instance

nielsdos · 2024-02-23T20:42:19Z

Could you add a test for GBK example you provided? Don't think you would need a database connection proper just to initialize a PdoMySQL instance

Actually AFAIK we do. Anyway, test added.

kamil-tekiela · 2024-02-23T20:47:33Z

Could you rebase the unit test onto master once the tests pass, please? It should be committed regardless of this PR. Also maybe change the description of the test to reflect that it tests escaping with GBK.

nielsdos · 2024-02-23T20:49:08Z

Could you rebase the unit test onto master once the tests pass, please? It should be committed regardless of this PR. Also maybe change the description of the test to reflect that it tests escaping with GBK.

Sure, I'll do that.

kamil-tekiela · 2024-02-23T20:49:09Z

ext/pdo_mysql/tests/pdo_mysql_quote_gbk_double_escape.phpt

+?>
+--FILE--
+<?php
+    require_once __DIR__ . '/inc/mysql_pdo_test.inc';


Unrelated, but what is the agreed upon policy when it comes to indentation in test files? Did we agree on any standard?

No idea. From a quick glance it seems the pdo_mysql* tests use indented FILE sections while the bug tests don't...

Encoding name Returns != 0 when ----------------------------- -------------------------------------- check_mb_big5 0xA1 <= c <= 0xF9 check_mb_cp932 0x81 <= c <= 0x9F || 0xE0 <= c <= 0xFC check_mb_eucjpms c >= 0x80 check_mb_euckr c >= 0x80 check_mb_gb2312 0xA1 <= c <= 0xF7 check_mb_gbk 0x81 <= c <= 0xFE check_mb_sjis 0x81 <= c <= 0x9F || 0xE0 <= c <= 0xFC check_mb_ucs2 always returns length 2 check_mb_ujis c >= 0x80 check_mb_utf16 complicated check_mb_utf32 always returns length 4 check_mb_utf8_valid c >= 0x80 check_mb_utf8mb3_valid c >= 0x80 my_ismbchar_gb18030 0x81 <= c <= 0xFE The ASCII-compatible encodings, i.e. cases where the c >= 0x80 check is sufficient, have the minimum char length == 1.

Same reasoning as for the validity check.

We allocate twice the input length, and every input character results in either 1 or 2 output bytes, so we cannot overflow.

nielsdos · 2024-02-23T22:59:04Z

Test committed here: 25dbe53
Rebased and cleaned up this PR.

kamil-tekiela · 2024-02-23T23:43:33Z

Please correct me if I am wrong. The optimization assumes that if the byte is within the range 0-127 and the character set is multibyte with minimum length of 1 character, then it is safe to treat this byte as ASCII?

Since we always check the first byte of the multibyte sequence, csets such as SJIS are not a problem. https://en.wikipedia.org/wiki/Shift_JIS#Structure

The problem would only be if a cset uses one of the 128 bytes as a starting byte in a multibyte sequence. For example, ISO-2022-JP would cause a problem, but afaik MySQL doesn't support it.

nielsdos · 2024-02-24T07:57:11Z

That is indeed correct.
The commit itself has a commit description where I checked whether that holds for all encodings.

kamil-tekiela

I guess it is ok to merge. I really hope that we haven't missed anything as this is a very important function in terms of security.

nielsdos · 2024-02-24T13:29:31Z

I guess it is ok to merge. I really hope that we haven't missed anything as this is a very important function in terms of security.

The overflow check -> assert change should be fine because each byte can at most output 2 bytes, and the truncation change seems safe too.
Indeed the change to avoid the call to mb_valid is trickier. I could fuzz it for a while, i.e. write a fuzzer harness that tries to find inputs that result in a different output when given to the unoptimized vs optimized version.

staabm · 2024-03-04T14:04:23Z

thanks again for working on my issue.

The overflow check -> assert change should be fine because each byte can at most output 2 bytes, and the truncation change seems safe too.

would it make sense to merge the commits which are less risky, so some low hanging fruits are not blocked on finding time to write the fuzzer?

kamil-tekiela · 2024-03-04T14:12:06Z

@staabm I don't think the other commits would improve performance. It's only that one commit that is risky because it skips the check for character sets that should be safe.

On another note, libmysql is much faster at performing this check than mysqlnd. I don't know what they do differently, but I observed almost 2x difference.

nielsdos · 2024-03-04T16:36:15Z

I noticed that the function is very sensitive to any modification, particularly for register spills and possibly CPU performance tricks.
Removing the overflow checks was only beneficial after also guarding the valid check.
I'll look into fuzzing this as a whole sometime soon.

nielsdos · 2024-03-04T20:09:53Z

I don't see libmysqlclient doing much special in comparison to us.
I'm going to think a bit about possible compiler tricks to reduce the impact of the indirect call instead of using a potential dangerous patch. I think it's possible because the compiler generates pessimistic code because the call target "could go to anywhere" in the eyes of the compiler. We know where it may go though, and by somehow telling that to the compiler we may be able to get interprocedural optimizations and optimistic code.

By using an enum, and a switch table (which will be efficiently compiled into a jump table), we can avoid the pessimistic code generation of the indirect calls. With this I get the following runtime for the test script in phpGH-13466 on my i7-4790, which is around 1.25x faster. Time (mean ± σ): 250.9 ms ± 1.6 ms [User: 248.4 ms, System: 2.0 ms] Range (min … max): 248.9 ms … 254.4 ms 11 runs

…ibyte sequence Almost every character set can be given a number N such that a multibyte sequence starts with a byte higher than that number N. This allows us to skip a lot of work. To ensure the correctness of this, a sanity check is implemented that exhaustively tries every 4-byte sequence for every character set and checks for consistency issues. This finally gives: Time (mean ± σ): 120.2 ms ± 1.2 ms [User: 116.9 ms, System: 2.8 ms] Range (min … max): 118.0 ms … 122.9 ms 24 runs

nielsdos · 2024-03-06T21:06:39Z

I've pushed two commits:

The first one undoes the dangerous optimization; and replaces it with a purely compiler-trick-based optimization. This results in a runtime average of 250.9 ms which is 1.25x faster than the original. The idea here is to remove all the virtual calls and replace it by a jump table (due to the switch), resulting in much better code generation. Instead of storing function pointers in the charset data, I store an enum member. A dispatch function will then call the right helper based on what the enum value was.
The second one retries the dangerous optimization, but in a safer way. To make sure I introduced no bugs I have added an (optional) sanity check that exhaustively tests inputs to check for inconsistencies. The idea here is to store for each charset what the lowest number is that a multibyte sequence can start with. For typical ASCII-compatible charsets this is 0x80, and for some others it is even higher than that (i.e. 0xA1). For complex charsets, or charsets that always emit more than one byte it is 0x00 (because there's always a multibyte sequence). We use this trick to skip the valid & mbcharlen calls whereever possible. This is the biggest save of the two commits.

These two combined result in a runtime even better than the original PR had:

  Time (mean ± σ):     120.2 ms ±   1.2 ms    [User: 116.9 ms, System: 2.8 ms]
  Range (min … max):   118.0 ms … 122.9 ms    24 runs

kamil-tekiela · 2024-03-07T17:54:09Z

ext/mysqlnd/mysqlnd_charset.c

+#define ENUMERATOR_DISPATCH(x) case x##_id: return x(c);
+		ENUMERATE_ENCODINGS_CHARLEN(ENUMERATOR_DISPATCH)
+#undef ENUMERATOR_DISPATCH
+		default: return mysqlnd_mbcharlen_null(c);


Doesn't this appear twice in this switch? In the first entry in enum and in the default clause?

We also need a default return so that the function returns something on all paths (otherwise it can be undefined behaviour). As such, I just put the null case as the default as well. All of this is optimized by the compiler so the duplicate case doesn't matter performance-wise.

kamil-tekiela · 2024-03-07T18:20:10Z

ext/mysqlnd/mysqlnd_driver.c

@@ -87,6 +87,11 @@ PHPAPI void mysqlnd_library_init(void)
 		mysqlnd_register_builtin_authentication_plugins();

 		mysqlnd_reverse_api_init();
+
+#if MYSQLND_CHARSETS_SANITY_CHECK == 1


This seems fine, but I assume we have no way of incorporating this into the build process, right? It can only be run manually by the developer.

Yeah I don't think we can incorporate this, it would also be very slow to do because it does exhaustive testing.
We have something similar with mbstring where there's an optional sanity check too that the developer has to manually enable.

kamil-tekiela

Ok, I think this should be safe.

…overhead We allocate twice the input length, and every input character results in either 1 or 2 output bytes, so we cannot overflow. By using an enum, and a switch table (which will be efficiently compiled into a jump table), we can avoid the pessimistic code generation of the indirect calls. With this I get the following runtime for the test script in GH-13466 on my i7-4790, which is around 1.25x faster. Time (mean ± σ): 250.9 ms ± 1.6 ms [User: 248.4 ms, System: 2.0 ms] Range (min … max): 248.9 ms … 254.4 ms 11 runs

nielsdos · 2024-03-07T19:27:39Z

Thanks Kamil. Cleaned up with a rebase and merged.

staabm · 2024-03-07T19:31:34Z

once again - thanks for everyone involved. <3.

nielsdos requested review from kamil-tekiela and SakiTakamachi February 21, 2024 22:31

github-actions bot added Extension: mysqlnd Extension: pdo_mysql labels Feb 21, 2024

Girgias reviewed Feb 21, 2024

View reviewed changes

SakiTakamachi reviewed Feb 22, 2024

View reviewed changes

nielsdos changed the title ~~Improve PDO::quote performance~~ Improve MySQLnd quote escaping performance Feb 22, 2024

kamil-tekiela reviewed Feb 23, 2024

View reviewed changes

nielsdos added 5 commits February 23, 2024 23:58

Avoid copy

4a0e50c

Only compute char len when necessary

a3e7444

Same reasoning as for the validity check.

Replace dead overflow checks with assertions

4e96b75

We allocate twice the input length, and every input character results in either 1 or 2 output bytes, so we cannot overflow.

Apply same optimization for escape_quotes

c754be8

nielsdos force-pushed the pdo-quote-perf branch from f754350 to c754be8 Compare February 23, 2024 22:59

kamil-tekiela approved these changes Feb 24, 2024

View reviewed changes

nielsdos added 2 commits March 6, 2024 20:42

nielsdos requested a review from kamil-tekiela March 6, 2024 21:20

kamil-tekiela reviewed Mar 7, 2024

View reviewed changes

kamil-tekiela approved these changes Mar 7, 2024

View reviewed changes

nielsdos closed this in 3ddd341 Mar 7, 2024

		if (cset->char_maxlen > 1 && (((zend_uchar ) escapestr) > 0x80 \|\| cset->char_minlen > 1)) {
		unsigned int len = cset->mb_valid(escapestr, end);

Improve MySQLnd quote escaping performance #13466

Improve MySQLnd quote escaping performance #13466

Conversation

nielsdos commented Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

staabm commented Feb 22, 2024

kamil-tekiela commented Feb 22, 2024

nielsdos commented Feb 22, 2024

kamil-tekiela commented Feb 22, 2024 • edited Loading

kamil-tekiela commented Feb 22, 2024

nielsdos commented Feb 22, 2024

kamil-tekiela commented Feb 22, 2024

Girgias commented Feb 23, 2024

nielsdos commented Feb 23, 2024

kamil-tekiela commented Feb 23, 2024

nielsdos commented Feb 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nielsdos commented Feb 23, 2024 • edited Loading

kamil-tekiela commented Feb 23, 2024

nielsdos commented Feb 24, 2024

kamil-tekiela left a comment

Choose a reason for hiding this comment

nielsdos commented Feb 24, 2024

staabm commented Mar 4, 2024

kamil-tekiela commented Mar 4, 2024

nielsdos commented Mar 4, 2024 • edited Loading

nielsdos commented Mar 4, 2024

nielsdos commented Mar 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kamil-tekiela left a comment

Choose a reason for hiding this comment

nielsdos commented Mar 7, 2024

staabm commented Mar 7, 2024

nielsdos commented Feb 21, 2024 •

edited

Loading

kamil-tekiela commented Feb 22, 2024 •

edited

Loading

nielsdos commented Feb 23, 2024 •

edited

Loading

nielsdos commented Mar 4, 2024 •

edited

Loading