Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add benchmarks for XML deserialization #605

Merged
merged 3 commits into from
Mar 21, 2022

Conversation

ianbotsf
Copy link
Contributor

Issue #

Addresses aws-sdk-kotlin#538

Description of changes

This change adds benchmarks for XML lexing and deserialization.

  • As part of running the benchmarks and comparing the results with the previous XmlPull implementation, several optimizations were made to the lexer. (Unfortunately, some of these changes hamper the readability and separation of that part of the code. Suggestions welcome on how to address that.)
  • Now that benchmarks prove the relatively-even performance of the new XML lexer vs XmlPull, the XmlPull deserializer is removed.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ianbotsf ianbotsf requested a review from a team as a code owner March 17, 2022 19:16
@ianbotsf ianbotsf requested a review from aajtodd March 17, 2022 19:16
Copy link
Contributor

@aajtodd aajtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall. Couple minor suggestions

fun peekAtMost(length: Int): String {
val actualLength = min(length, end - offset)
return sliceByLength(actualLength)
private fun checkBounds(length: Int, errCondition: String) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could probably mark this as inline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that but there's no significant difference in benchmark speed either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then no point inlining

'\u2fef' < c && c < '\u3001' ||
'\ud7ff' < c
) {
error("Unable to find valid XML start name character")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be useful to add the character that was found

private fun Char.toRange() = this..this

// https://www.w3.org/TR/xml/#NT-Name
private val nameStartCharRanges = setOf(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting. I would not have expected these to be a bottleneck. Certainly easier to read and map to the spec this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it was more readable before. I think the bottleneck is because neither the JVM stdlib nor Kotlin stdlib special-cases small set sizes nor scalar sets. This used a full linked hashset implementation with Char boxing/unboxing to perform the lookups, which was slower than a tight if condition directly on scalars.

var peekOffset = offset + 1
while (peekOffset < end) {
val ch = source[peekOffset]
if (
Copy link
Contributor

@aajtodd aajtodd Mar 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multiline if statement is fine. Though it may be cleaner to look for the valid character ranges rather than the invalid ones. I'd also expect that most XML we get is going to end a name by finding the end of the tag > (or a space indicating start of an attribute).

The way this is currently structured you actually end up checking until you hit an invalid character. It may be faster to look for valid characters (prioritizing ascii) first. In other words right now we end up checking every branch to prove that a character isn't an invalid name char.

The opposite may be quicker as we expect in most cases to find valid chars (especially given that the XML names come from smithy shape names which is a restricted character set anyway)

when(val ch = source[peekOffset]) {
    in 'a'..'z',
    in 'A'..'Z',
    ...,
    in  '\u203f'..'\u2040' -> { peekOffset++; continue }
    else -> error(...)
}

You could even prioritize the common branches for how a name will end:

when val ch = source[peekOffset++]) {
     in 'a'..'z', in 'A'..'Z' -> continue
     ' ', '>' -> break
     // other valid cases
     else -> invalid
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, optimizing for valid and expected characters first improves the performance. Using when or range checks (e.g., c in 'A'..'Z') actually hurts performance so I'll skip those (although they are more readable).

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@ianbotsf ianbotsf merged commit dec805b into feat-kmp-xml Mar 21, 2022
@ianbotsf ianbotsf deleted the xml-deserializer-benchmarks branch March 21, 2022 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants