Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve MySQLnd quote escaping performance #13466

Closed
wants to merge 7 commits into from

Conversation

nielsdos
Copy link
Member

@nielsdos nielsdos commented Feb 21, 2024

Closes GH-13440

Commits are split for easier reviewing, and I would recommend to review them in order one by one.
I would commit them one by one separately upon approval.

For this script:

$pdo = new PDO("mysql:host=$hostname", $username, $password);
for ($i = 0; $i < 1000 * 1000; $i++) {
    $pdo->quote('this is a test with a relatively short \"string\"...');
}
Benchmark 1: ../php-src-multitasking/sapi/cli/php_old gh13440.php
  Time (mean ± σ):     313.3 ms ±   3.3 ms    [User: 311.3 ms, System: 1.4 ms]
  Range (min … max):   310.4 ms … 321.6 ms    10 runs
 
Benchmark 2: ../php-src-multitasking/sapi/cli/php gh13440.php
  Time (mean ± σ):     129.7 ms ±   2.0 ms    [User: 125.5 ms, System: 3.7 ms]
  Range (min … max):   127.5 ms … 135.7 ms    21 runs
 
Summary
  ../php-src-multitasking/sapi/cli/php gh13440.php ran
    2.42 ± 0.05 times faster than ../php-src-multitasking/sapi/cli/php_old gh13440.php

For this script:

$pdo = new PDO("mysql:host=$hostname", $username, $password);
$str = str_repeat('\xff', 30);
for ($i = 0; $i < 1000 * 1000; $i++) {
    $pdo->quote($str);
}
Benchmark 1: ../php-src-multitasking/sapi/cli/php_old gh13440.php
  Time (mean ± σ):     700.5 ms ±   5.5 ms    [User: 695.4 ms, System: 3.9 ms]
  Range (min … max):   692.6 ms … 713.9 ms    10 runs
 
Benchmark 2: ../php-src-multitasking/sapi/cli/php gh13440.php
  Time (mean ± σ):     229.4 ms ±   2.6 ms    [User: 224.1 ms, System: 4.6 ms]
  Range (min … max):   225.8 ms … 235.6 ms    12 runs
 
Summary
  ../php-src-multitasking/sapi/cli/php gh13440.php ran
    3.05 ± 0.04 times faster than ../php-src-multitasking/sapi/cli/php_old gh13440.php

/* check unicode characters
* Encodings that have a minimum length of 1 are compatible with ASCII.
* So we can skip (for performance reasons) the check to mb_valid for them. */
if (cset->char_maxlen > 1 && (*((zend_uchar *) escapestr) > 0x80 || cset->char_minlen > 1) && (len = cset->mb_valid(escapestr, end))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to check for the UTF-8 flag on zend_strings or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, but we also have to check that the charset is utf8mb4. It would require breaking the public API though as mysqlnd_cset_escape_quotes does not take zend_string*. I'm not sure what impact this would have on 3rd party drivers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So TIL that we expose mysqln driver functions. Let's keep this suggestion for a follow-up PR.

Comment on lines 829 to 834
if (cset->char_maxlen > 1 && (*((zend_uchar *) escapestr) > 0x80 || cset->char_minlen > 1)) {
unsigned int len = cset->mb_valid(escapestr, end);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looked like it could be skipped in the same way in mysqlnd_cset_escape_quotes, but isn't there much of a change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I factored out the check so I could reuse it for this function.

@staabm
Copy link
Contributor

staabm commented Feb 22, 2024

really cool, thank you sooo much. do you think we can/could/should do similar things for

        for ($i = 0; $i < 1000 * 1000; $i++) {
            mysqli_real_escape_string('this is a test with a relatively short \"string\"...');
        }

?

@kamil-tekiela
Copy link
Member

really cool, thank you sooo much. do you think we can/could/should do similar things for

        for ($i = 0; $i < 1000 * 1000; $i++) {
            mysqli_real_escape_string('this is a test with a relatively short \"string\"...');
        }

?

Probably the PR's name is a bit misleading. If PDO is using mysqlnd then it's the same code. If PDO uses libmysql then we have no control over how the escaping function works.

@nielsdos nielsdos changed the title Improve PDO::quote performance Improve MySQLnd quote escaping performance Feb 22, 2024
@nielsdos
Copy link
Member Author

You're right, renamed.

@kamil-tekiela
Copy link
Member

kamil-tekiela commented Feb 22, 2024

@nielsdos I am trying to understand the old code as well as your changes. I have to admit I am having a hard time. The mysqlnd code is not well documented and the comments and method names are vague. However, it was all based on the libmysql code, so I am checking it to see how it's supposed to work. For example, https://github.com/mysql/mysql-server/blob/trunk/mysys/charset.cc#L439 but I still do not understand how it works. Maybe it will help you?

See also SQL injection that gets around mysql_real_escape_string() for an explanation of how this function prevents this attack.

@kamil-tekiela
Copy link
Member

One thing I just cannot understand. In the function declaration it says:

static unsigned int mysqlnd_mbcharlen_gb18030(const unsigned int c)

but we are not passing an integer to it, right? How does that work?

@nielsdos
Copy link
Member Author

For example, https://github.com/mysql/mysql-server/blob/trunk/mysys/charset.cc#L439 but I still do not understand how it works. Maybe it will help you?

So it looks like we can hit that, but we don't actually seem to test that in PHP's test suite. I read the code you linked and I think it clicked now.

I'm going to reuse their example, but I'll try to explain it with more words why the character length check is there.

Assume we are using the GBK encoding.
Assume our input is 0xbf 0x27.
Assume we don't do the character length check, so we can see what would happen.

  1. First iteration, i.e. *from == 0xbf. This isn't a valid GBK multibyte sequence start, so the validity check here fails: https://github.com/mysql/mysql-server/blob/824e2b4064053f7daf17d7f3f84b7a3ed92e5fb4/mysys/charset.cc#L423
    Without the character length check, we'd check if we need to escape the current character 0xbf. The character 0xbf isn't handled in the switch case so we don't escape it and append 0xbf to the output buffer.
  2. Second iteration, i.e. *from == 0x27. This isn't a valid start either, so we go to the escape logic. Note that 0x27 is the character ', so we have to escape! We write two bytes to the output: \ (this is 0x5c) and ' (this is 0x27).
  3. The function finished, let's look at the output: 0xbf 0x5c 0x27. Now we actually made a problem: 0xbf 0x5c is a valid GBK multibyte sequence! So we transformed an invalid multibyte sequences into a valid one, potentially corrupting data. The solution is to check whether it could have been part of a multibyte sequence, but the checks are less strict.

but we are not passing an integer to it, right? How does that work?

I think this is just an API oddity. Looking at the mysql code:

https://github.com/mysql/mysql-server/blob/824e2b4064053f7daf17d7f3f84b7a3ed92e5fb4/strings/ctype.cc#L972

this passes only the current byte to my_mbcharlen, which is similar what we do.

@kamil-tekiela
Copy link
Member

Thank you. Your explanation actually made it clear to me. I tested right now and it works exactly how you described.

@Girgias
Copy link
Member

Girgias commented Feb 23, 2024

Could you add a test for GBK example you provided? Don't think you would need a database connection proper just to initialize a PdoMySQL instance

@nielsdos
Copy link
Member Author

Could you add a test for GBK example you provided? Don't think you would need a database connection proper just to initialize a PdoMySQL instance

Actually AFAIK we do. Anyway, test added.

@kamil-tekiela
Copy link
Member

Could you rebase the unit test onto master once the tests pass, please? It should be committed regardless of this PR. Also maybe change the description of the test to reflect that it tests escaping with GBK.

@nielsdos
Copy link
Member Author

Could you rebase the unit test onto master once the tests pass, please? It should be committed regardless of this PR. Also maybe change the description of the test to reflect that it tests escaping with GBK.

Sure, I'll do that.

?>
--FILE--
<?php
require_once __DIR__ . '/inc/mysql_pdo_test.inc';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, but what is the agreed upon policy when it comes to indentation in test files? Did we agree on any standard?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea. From a quick glance it seems the pdo_mysql* tests use indented FILE sections while the bug tests don't...

Encoding name			Returns != 0 when
-----------------------------   --------------------------------------
check_mb_big5			0xA1 <= c <= 0xF9
check_mb_cp932			0x81 <= c <= 0x9F || 0xE0 <= c <= 0xFC
check_mb_eucjpms		c >= 0x80
check_mb_euckr			c >= 0x80
check_mb_gb2312			0xA1 <= c <= 0xF7
check_mb_gbk			0x81 <= c <= 0xFE
check_mb_sjis			0x81 <= c <= 0x9F || 0xE0 <= c <= 0xFC
check_mb_ucs2			always returns length 2
check_mb_ujis			c >= 0x80
check_mb_utf16			complicated
check_mb_utf32			always returns length 4
check_mb_utf8_valid		c >= 0x80
check_mb_utf8mb3_valid		c >= 0x80
my_ismbchar_gb18030		0x81 <= c <= 0xFE

The ASCII-compatible encodings, i.e. cases where the c >= 0x80 check is
sufficient, have the minimum char length == 1.
Same reasoning as for the validity check.
We allocate twice the input length, and every input character results in
either 1 or 2 output bytes, so we cannot overflow.
@nielsdos
Copy link
Member Author

nielsdos commented Feb 23, 2024

Test committed here: 25dbe53
Rebased and cleaned up this PR.

@kamil-tekiela
Copy link
Member

Please correct me if I am wrong. The optimization assumes that if the byte is within the range 0-127 and the character set is multibyte with minimum length of 1 character, then it is safe to treat this byte as ASCII?

Since we always check the first byte of the multibyte sequence, csets such as SJIS are not a problem. https://en.wikipedia.org/wiki/Shift_JIS#Structure

The problem would only be if a cset uses one of the 128 bytes as a starting byte in a multibyte sequence. For example, ISO-2022-JP would cause a problem, but afaik MySQL doesn't support it.

@nielsdos
Copy link
Member Author

That is indeed correct.
The commit itself has a commit description where I checked whether that holds for all encodings.

Copy link
Member

@kamil-tekiela kamil-tekiela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is ok to merge. I really hope that we haven't missed anything as this is a very important function in terms of security.

@nielsdos
Copy link
Member Author

I guess it is ok to merge. I really hope that we haven't missed anything as this is a very important function in terms of security.

The overflow check -> assert change should be fine because each byte can at most output 2 bytes, and the truncation change seems safe too.
Indeed the change to avoid the call to mb_valid is trickier. I could fuzz it for a while, i.e. write a fuzzer harness that tries to find inputs that result in a different output when given to the unoptimized vs optimized version.

@staabm
Copy link
Contributor

staabm commented Mar 4, 2024

thanks again for working on my issue.

The overflow check -> assert change should be fine because each byte can at most output 2 bytes, and the truncation change seems safe too.

would it make sense to merge the commits which are less risky, so some low hanging fruits are not blocked on finding time to write the fuzzer?

@kamil-tekiela
Copy link
Member

@staabm I don't think the other commits would improve performance. It's only that one commit that is risky because it skips the check for character sets that should be safe.

On another note, libmysql is much faster at performing this check than mysqlnd. I don't know what they do differently, but I observed almost 2x difference.

@nielsdos
Copy link
Member Author

nielsdos commented Mar 4, 2024

I noticed that the function is very sensitive to any modification, particularly for register spills and possibly CPU performance tricks.
Removing the overflow checks was only beneficial after also guarding the valid check.
I'll look into fuzzing this as a whole sometime soon.

@nielsdos
Copy link
Member Author

nielsdos commented Mar 4, 2024

I don't see libmysqlclient doing much special in comparison to us.
I'm going to think a bit about possible compiler tricks to reduce the impact of the indirect call instead of using a potential dangerous patch. I think it's possible because the compiler generates pessimistic code because the call target "could go to anywhere" in the eyes of the compiler. We know where it may go though, and by somehow telling that to the compiler we may be able to get interprocedural optimizations and optimistic code.

nielsdos added 2 commits March 6, 2024 20:42
By using an enum, and a switch table (which will be efficiently compiled
into a jump table), we can avoid the pessimistic code generation of the
indirect calls.

With this I get the following runtime for the test script in phpGH-13466 on
my i7-4790, which is around 1.25x faster.
  Time (mean ± σ):     250.9 ms ±   1.6 ms    [User: 248.4 ms, System: 2.0 ms]
  Range (min … max):   248.9 ms … 254.4 ms    11 runs
…ibyte sequence

Almost every character set can be given a number N such that a multibyte
sequence starts with a byte higher than that number N. This allows us to
skip a lot of work. To ensure the correctness of this, a sanity check is
implemented that exhaustively tries every 4-byte sequence for every
character set and checks for consistency issues.

This finally gives:
  Time (mean ± σ):     120.2 ms ±   1.2 ms    [User: 116.9 ms, System: 2.8 ms]
  Range (min … max):   118.0 ms … 122.9 ms    24 runs
@nielsdos
Copy link
Member Author

nielsdos commented Mar 6, 2024

I've pushed two commits:

  • The first one undoes the dangerous optimization; and replaces it with a purely compiler-trick-based optimization. This results in a runtime average of 250.9 ms which is 1.25x faster than the original. The idea here is to remove all the virtual calls and replace it by a jump table (due to the switch), resulting in much better code generation. Instead of storing function pointers in the charset data, I store an enum member. A dispatch function will then call the right helper based on what the enum value was.
  • The second one retries the dangerous optimization, but in a safer way. To make sure I introduced no bugs I have added an (optional) sanity check that exhaustively tests inputs to check for inconsistencies. The idea here is to store for each charset what the lowest number is that a multibyte sequence can start with. For typical ASCII-compatible charsets this is 0x80, and for some others it is even higher than that (i.e. 0xA1). For complex charsets, or charsets that always emit more than one byte it is 0x00 (because there's always a multibyte sequence). We use this trick to skip the valid & mbcharlen calls whereever possible. This is the biggest save of the two commits.

These two combined result in a runtime even better than the original PR had:

  Time (mean ± σ):     120.2 ms ±   1.2 ms    [User: 116.9 ms, System: 2.8 ms]
  Range (min … max):   118.0 ms … 122.9 ms    24 runs

@nielsdos nielsdos requested a review from kamil-tekiela March 6, 2024 21:20
#define ENUMERATOR_DISPATCH(x) case x##_id: return x(c);
ENUMERATE_ENCODINGS_CHARLEN(ENUMERATOR_DISPATCH)
#undef ENUMERATOR_DISPATCH
default: return mysqlnd_mbcharlen_null(c);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this appear twice in this switch? In the first entry in enum and in the default clause?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need a default return so that the function returns something on all paths (otherwise it can be undefined behaviour). As such, I just put the null case as the default as well. All of this is optimized by the compiler so the duplicate case doesn't matter performance-wise.

@@ -87,6 +87,11 @@ PHPAPI void mysqlnd_library_init(void)
mysqlnd_register_builtin_authentication_plugins();

mysqlnd_reverse_api_init();

#if MYSQLND_CHARSETS_SANITY_CHECK == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine, but I assume we have no way of incorporating this into the build process, right? It can only be run manually by the developer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't think we can incorporate this, it would also be very slow to do because it does exhaustive testing.
We have something similar with mbstring where there's an optional sanity check too that the developer has to manually enable.

Copy link
Member

@kamil-tekiela kamil-tekiela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think this should be safe.

nielsdos added a commit that referenced this pull request Mar 7, 2024
…overhead

We allocate twice the input length, and every input character results in
either 1 or 2 output bytes, so we cannot overflow.

By using an enum, and a switch table (which will be efficiently compiled
into a jump table), we can avoid the pessimistic code generation of the
indirect calls.

With this I get the following runtime for the test script in GH-13466 on
my i7-4790, which is around 1.25x faster.
  Time (mean ± σ):     250.9 ms ±   1.6 ms    [User: 248.4 ms, System: 2.0 ms]
  Range (min … max):   248.9 ms … 254.4 ms    11 runs
@nielsdos nielsdos closed this in 3ddd341 Mar 7, 2024
@nielsdos
Copy link
Member Author

nielsdos commented Mar 7, 2024

Thanks Kamil. Cleaned up with a rebase and merged.

@staabm
Copy link
Contributor

staabm commented Mar 7, 2024

once again - thanks for everyone involved. <3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PDO quote bottleneck
5 participants