-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add benchmarks for XML deserialization #605
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great overall. Couple minor suggestions
fun peekAtMost(length: Int): String { | ||
val actualLength = min(length, end - offset) | ||
return sliceByLength(actualLength) | ||
private fun checkBounds(length: Int, errCondition: String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could probably mark this as inline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that but there's no significant difference in benchmark speed either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then no point inlining
'\u2fef' < c && c < '\u3001' || | ||
'\ud7ff' < c | ||
) { | ||
error("Unable to find valid XML start name character") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be useful to add the character that was found
private fun Char.toRange() = this..this | ||
|
||
// https://www.w3.org/TR/xml/#NT-Name | ||
private val nameStartCharRanges = setOf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting. I would not have expected these to be a bottleneck. Certainly easier to read and map to the spec this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it was more readable before. I think the bottleneck is because neither the JVM stdlib nor Kotlin stdlib special-cases small set sizes nor scalar sets. This used a full linked hashset implementation with Char boxing/unboxing to perform the lookups, which was slower than a tight if
condition directly on scalars.
var peekOffset = offset + 1 | ||
while (peekOffset < end) { | ||
val ch = source[peekOffset] | ||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multiline if statement is fine. Though it may be cleaner to look for the valid character ranges rather than the invalid ones. I'd also expect that most XML we get is going to end a name by finding the end of the tag >
(or a space indicating start of an attribute).
The way this is currently structured you actually end up checking until you hit an invalid character. It may be faster to look for valid characters (prioritizing ascii) first. In other words right now we end up checking every branch to prove that a character isn't an invalid name char.
The opposite may be quicker as we expect in most cases to find valid chars (especially given that the XML names come from smithy shape names which is a restricted character set anyway)
when(val ch = source[peekOffset]) {
in 'a'..'z',
in 'A'..'Z',
...,
in '\u203f'..'\u2040' -> { peekOffset++; continue }
else -> error(...)
}
You could even prioritize the common branches for how a name will end:
when val ch = source[peekOffset++]) {
in 'a'..'z', in 'A'..'Z' -> continue
' ', '>' -> break
// other valid cases
else -> invalid
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, optimizing for valid and expected characters first improves the performance. Using when
or range checks (e.g., c in 'A'..'Z'
) actually hurts performance so I'll skip those (although they are more readable).
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
Issue #
Addresses aws-sdk-kotlin#538
Description of changes
This change adds benchmarks for XML lexing and deserialization.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.