-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve MySQLnd quote escaping performance #13466
Conversation
ext/mysqlnd/mysqlnd_charset.c
Outdated
/* check unicode characters | ||
* Encodings that have a minimum length of 1 are compatible with ASCII. | ||
* So we can skip (for performance reasons) the check to mb_valid for them. */ | ||
if (cset->char_maxlen > 1 && (*((zend_uchar *) escapestr) > 0x80 || cset->char_minlen > 1) && (len = cset->mb_valid(escapestr, end))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to check for the UTF-8 flag on zend_strings or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, but we also have to check that the charset is utf8mb4. It would require breaking the public API though as mysqlnd_cset_escape_quotes
does not take zend_string*
. I'm not sure what impact this would have on 3rd party drivers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So TIL that we expose mysqln driver functions. Let's keep this suggestion for a follow-up PR.
ext/mysqlnd/mysqlnd_charset.c
Outdated
if (cset->char_maxlen > 1 && (*((zend_uchar *) escapestr) > 0x80 || cset->char_minlen > 1)) { | ||
unsigned int len = cset->mb_valid(escapestr, end); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looked like it could be skipped in the same way in mysqlnd_cset_escape_quotes
, but isn't there much of a change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I factored out the check so I could reuse it for this function.
really cool, thank you sooo much. do you think we can/could/should do similar things for
? |
Probably the PR's name is a bit misleading. If PDO is using mysqlnd then it's the same code. If PDO uses libmysql then we have no control over how the escaping function works. |
You're right, renamed. |
@nielsdos I am trying to understand the old code as well as your changes. I have to admit I am having a hard time. The mysqlnd code is not well documented and the comments and method names are vague. However, it was all based on the libmysql code, so I am checking it to see how it's supposed to work. For example, https://github.com/mysql/mysql-server/blob/trunk/mysys/charset.cc#L439 but I still do not understand how it works. Maybe it will help you? See also SQL injection that gets around mysql_real_escape_string() for an explanation of how this function prevents this attack. |
One thing I just cannot understand. In the function declaration it says:
but we are not passing an integer to it, right? How does that work? |
So it looks like we can hit that, but we don't actually seem to test that in PHP's test suite. I read the code you linked and I think it clicked now. I'm going to reuse their example, but I'll try to explain it with more words why the character length check is there. Assume we are using the GBK encoding.
I think this is just an API oddity. Looking at the mysql code: this passes only the current byte to |
Thank you. Your explanation actually made it clear to me. I tested right now and it works exactly how you described. |
Could you add a test for GBK example you provided? Don't think you would need a database connection proper just to initialize a PdoMySQL instance |
Actually AFAIK we do. Anyway, test added. |
Could you rebase the unit test onto master once the tests pass, please? It should be committed regardless of this PR. Also maybe change the description of the test to reflect that it tests escaping with GBK. |
Sure, I'll do that. |
?> | ||
--FILE-- | ||
<?php | ||
require_once __DIR__ . '/inc/mysql_pdo_test.inc'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated, but what is the agreed upon policy when it comes to indentation in test files? Did we agree on any standard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea. From a quick glance it seems the pdo_mysql* tests use indented FILE sections while the bug tests don't...
Encoding name Returns != 0 when ----------------------------- -------------------------------------- check_mb_big5 0xA1 <= c <= 0xF9 check_mb_cp932 0x81 <= c <= 0x9F || 0xE0 <= c <= 0xFC check_mb_eucjpms c >= 0x80 check_mb_euckr c >= 0x80 check_mb_gb2312 0xA1 <= c <= 0xF7 check_mb_gbk 0x81 <= c <= 0xFE check_mb_sjis 0x81 <= c <= 0x9F || 0xE0 <= c <= 0xFC check_mb_ucs2 always returns length 2 check_mb_ujis c >= 0x80 check_mb_utf16 complicated check_mb_utf32 always returns length 4 check_mb_utf8_valid c >= 0x80 check_mb_utf8mb3_valid c >= 0x80 my_ismbchar_gb18030 0x81 <= c <= 0xFE The ASCII-compatible encodings, i.e. cases where the c >= 0x80 check is sufficient, have the minimum char length == 1.
Same reasoning as for the validity check.
We allocate twice the input length, and every input character results in either 1 or 2 output bytes, so we cannot overflow.
f754350
to
c754be8
Compare
Test committed here: 25dbe53 |
Please correct me if I am wrong. The optimization assumes that if the byte is within the range 0-127 and the character set is multibyte with minimum length of 1 character, then it is safe to treat this byte as ASCII? Since we always check the first byte of the multibyte sequence, csets such as SJIS are not a problem. https://en.wikipedia.org/wiki/Shift_JIS#Structure The problem would only be if a cset uses one of the 128 bytes as a starting byte in a multibyte sequence. For example, ISO-2022-JP would cause a problem, but afaik MySQL doesn't support it. |
That is indeed correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is ok to merge. I really hope that we haven't missed anything as this is a very important function in terms of security.
The overflow check -> assert change should be fine because each byte can at most output 2 bytes, and the truncation change seems safe too. |
thanks again for working on my issue.
would it make sense to merge the commits which are less risky, so some low hanging fruits are not blocked on finding time to write the fuzzer? |
@staabm I don't think the other commits would improve performance. It's only that one commit that is risky because it skips the check for character sets that should be safe. On another note, libmysql is much faster at performing this check than mysqlnd. I don't know what they do differently, but I observed almost 2x difference. |
I noticed that the function is very sensitive to any modification, particularly for register spills and possibly CPU performance tricks. |
I don't see libmysqlclient doing much special in comparison to us. |
By using an enum, and a switch table (which will be efficiently compiled into a jump table), we can avoid the pessimistic code generation of the indirect calls. With this I get the following runtime for the test script in phpGH-13466 on my i7-4790, which is around 1.25x faster. Time (mean ± σ): 250.9 ms ± 1.6 ms [User: 248.4 ms, System: 2.0 ms] Range (min … max): 248.9 ms … 254.4 ms 11 runs
…ibyte sequence Almost every character set can be given a number N such that a multibyte sequence starts with a byte higher than that number N. This allows us to skip a lot of work. To ensure the correctness of this, a sanity check is implemented that exhaustively tries every 4-byte sequence for every character set and checks for consistency issues. This finally gives: Time (mean ± σ): 120.2 ms ± 1.2 ms [User: 116.9 ms, System: 2.8 ms] Range (min … max): 118.0 ms … 122.9 ms 24 runs
I've pushed two commits:
These two combined result in a runtime even better than the original PR had:
|
#define ENUMERATOR_DISPATCH(x) case x##_id: return x(c); | ||
ENUMERATE_ENCODINGS_CHARLEN(ENUMERATOR_DISPATCH) | ||
#undef ENUMERATOR_DISPATCH | ||
default: return mysqlnd_mbcharlen_null(c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this appear twice in this switch
? In the first entry in enum and in the default clause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need a default return so that the function returns something on all paths (otherwise it can be undefined behaviour). As such, I just put the null case as the default as well. All of this is optimized by the compiler so the duplicate case doesn't matter performance-wise.
@@ -87,6 +87,11 @@ PHPAPI void mysqlnd_library_init(void) | |||
mysqlnd_register_builtin_authentication_plugins(); | |||
|
|||
mysqlnd_reverse_api_init(); | |||
|
|||
#if MYSQLND_CHARSETS_SANITY_CHECK == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine, but I assume we have no way of incorporating this into the build process, right? It can only be run manually by the developer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't think we can incorporate this, it would also be very slow to do because it does exhaustive testing.
We have something similar with mbstring where there's an optional sanity check too that the developer has to manually enable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think this should be safe.
…overhead We allocate twice the input length, and every input character results in either 1 or 2 output bytes, so we cannot overflow. By using an enum, and a switch table (which will be efficiently compiled into a jump table), we can avoid the pessimistic code generation of the indirect calls. With this I get the following runtime for the test script in GH-13466 on my i7-4790, which is around 1.25x faster. Time (mean ± σ): 250.9 ms ± 1.6 ms [User: 248.4 ms, System: 2.0 ms] Range (min … max): 248.9 ms … 254.4 ms 11 runs
Thanks Kamil. Cleaned up with a rebase and merged. |
once again - thanks for everyone involved. <3. |
Closes GH-13440
Commits are split for easier reviewing, and I would recommend to review them in order one by one.
I would commit them one by one separately upon approval.
For this script:
For this script: